Hypothesis

Postnatal environmental exposures, particularly those found in household products and dietary intake, along with specific serum metabolomics profiles, are significantly associated with the BMI Z-score of children aged 6-11 years. Higher concentrations of certain metabolites in serum, reflecting exposure to chemical classes or metals, will correlate with variations in BMI Z-score, controlling for age and other relevant covariates. Some metabolites associated with chemical exposures and dietary patterns can serve as biomarkers for the risk of developing obesity.

Background

Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.

A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.

The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.

The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.

These studies collectively illustrate the critical role of environmental EDCs in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.

Data Description

This study will utilize data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals like BPA and PCBs in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between EDC exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("e3_sex_None", "e3_yearbir_None", "h_edumc_None", "h_cohort", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
variable_name domain family subfamily period location period_postnatal description var_type transformation labels labelsshort
h_bfdur_Ter h_bfdur_Ter Lifestyles Lifestyle Diet Postnatal NA NA Breastfeeding duration (weeks) factor Tertiles Breastfeeding Breastfeeding
hs_bakery_prod_Ter hs_bakery_prod_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bakery products (hs_cookies + hs_pastries) factor Tertiles Bakery prod BakeProd
hs_beverages_Ter hs_beverages_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: beverages (hs_dietsoda+hs_soda) factor Tertiles Soda Soda
hs_break_cer_Ter hs_break_cer_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: breakfast cereal (hs_sugarcer+hs_othcer) factor Tertiles BF cereals BFcereals
hs_caff_drink_Ter hs_caff_drink_Ter Lifestyles Lifestyle Diet Postnatal NA NA Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) factor Tertiles Caffeine Caffeine
hs_dairy_Ter hs_dairy_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) factor Tertiles Dairy Dairy
hs_fastfood_Ter hs_fastfood_Ter Lifestyles Lifestyle Diet Postnatal NA NA Visits a fast food restaurant/take away factor Tertiles Fastfood Fastfood
hs_KIDMED_None hs_KIDMED_None Lifestyles Lifestyle Diet Postnatal NA NA Sum of KIDMED indices, without index9 numeric None KIDMED KIDMED
hs_mvpa_prd_alt_None hs_mvpa_prd_alt_None Lifestyles Lifestyle Physical activity Postnatal NA NA Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) numeric None PA PA
hs_org_food_Ter hs_org_food_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats organic food factor Tertiles Organicfood Organicfood
hs_proc_meat_Ter hs_proc_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: processed meat (hs_coldmeat+hs_ham) factor Tertiles Processed meat ProcMeat
hs_readymade_Ter hs_readymade_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats a æready-made supermarket meal factor Tertiles Ready made food ReadyFood
hs_sd_wk_None hs_sd_wk_None Lifestyles Lifestyle Physical activity Postnatal NA NA sedentary behaviour (min/day) numeric None Sedentary Sedentary
hs_total_bread_Ter hs_total_bread_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bread (hs_darkbread+hs_whbread) factor Tertiles Bread Bread
hs_total_cereal_Ter hs_total_cereal_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) factor Tertiles Cereals Cereals
hs_total_fish_Ter hs_total_fish_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) factor Tertiles Fish Fish
hs_total_fruits_Ter hs_total_fruits_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) factor Tertiles Fruits Fruits
hs_total_lipids_Ter hs_total_lipids_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: Added fat factor Tertiles Diet fat Diet fat
hs_total_meat_Ter hs_total_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) factor Tertiles Meat Meat
hs_total_potatoes_Ter hs_total_potatoes_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: potatoes (hs_frenchfries+hs_potatoes) factor Tertiles Potatoes Potatoes
hs_total_sweets_Ter hs_total_sweets_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: sweets (hs_choco + hs_sweets + hs_sugar) factor Tertiles Sweets Sweets
hs_total_veg_Ter hs_total_veg_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: vegetables (hs_cookveg+hs_rawveg) factor Tertiles Vegetables Vegetables
hs_total_yog_Ter hs_total_yog_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: yogurt (hs_yogurt+hs_probiotic) factor Tertiles Yogurt Yogurt
hs_dif_hours_total_None hs_dif_hours_total_None Lifestyles Lifestyle Sleep Postnatal NA NA Total hours of sleep (mean weekdays and night) numeric None Sleep Sleep
hs_as_c_Log2 hs_as_c_Log2 Chemicals Metals As Postnatal NA NA Arsenic (As) in child numeric Logarithm base 2 As As
hs_cd_c_Log2 hs_cd_c_Log2 Chemicals Metals Cd Postnatal NA NA Cadmium (Cd) in child numeric Logarithm base 2 Cd Cd
hs_co_c_Log2 hs_co_c_Log2 Chemicals Metals Co Postnatal NA NA Cobalt (Co) in child numeric Logarithm base 2 Co Co
hs_cs_c_Log2 hs_cs_c_Log2 Chemicals Metals Cs Postnatal NA NA Caesium (Cs) in child numeric Logarithm base 2 Cs Cs
hs_cu_c_Log2 hs_cu_c_Log2 Chemicals Metals Cu Postnatal NA NA Copper (Cu) in child numeric Logarithm base 2 Cu Cu
hs_hg_c_Log2 hs_hg_c_Log2 Chemicals Metals Hg Postnatal NA NA Mercury (Hg) in child numeric Logarithm base 2 Hg Hg
hs_mn_c_Log2 hs_mn_c_Log2 Chemicals Metals Mn Postnatal NA NA Manganese (Mn) in child numeric Logarithm base 2 Mn Mn
hs_mo_c_Log2 hs_mo_c_Log2 Chemicals Metals Mo Postnatal NA NA Molybdenum (Mo) in child numeric Logarithm base 2 Mo Mo
hs_pb_c_Log2 hs_pb_c_Log2 Chemicals Metals Pb Postnatal NA NA Lead (Pb) in child numeric Logarithm base 2 Pb Pb
hs_tl_cdich_None hs_tl_cdich_None Chemicals Metals Tl Postnatal NA NA Dichotomous variable of thallium (Tl) in child factor None Tl Tl
hs_dde_cadj_Log2 hs_dde_cadj_Log2 Chemicals Organochlorines DDE Postnatal NA NA Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids numeric Logarithm base 2 DDE DDE
hs_ddt_cadj_Log2 hs_ddt_cadj_Log2 Chemicals Organochlorines DDT Postnatal NA NA Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids numeric Logarithm base 2 DDT DDT
hs_hcb_cadj_Log2 hs_hcb_cadj_Log2 Chemicals Organochlorines HCB Postnatal NA NA Hexachlorobenzene (HCB) in child adjusted for lipids numeric Logarithm base 2 HCB HCB
hs_pcb118_cadj_Log2 hs_pcb118_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids numeric Logarithm base 2 PCB 118 PCB118
hs_pcb138_cadj_Log2 hs_pcb138_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids numeric Logarithm base 2 PCB 138 PCB138
hs_pcb153_cadj_Log2 hs_pcb153_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids numeric Logarithm base 2 PCB 153 PCB153
hs_pcb170_cadj_Log2 hs_pcb170_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids numeric Logarithm base 2 PCB 170 PCB170
hs_pcb180_cadj_Log2 hs_pcb180_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids numeric Logarithm base 2 PCB 180 PCB180
hs_sumPCBs5_cadj_Log2 hs_sumPCBs5_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Sum of PCBs in child adjusted for lipids (4 cohorts) numeric Logarithm base 2 PCBs SumPCB
hs_dep_cadj_Log2 hs_dep_cadj_Log2 Chemicals Organophosphate pesticides DEP Postnatal NA NA Diethyl phosphate (DEP) in child adjusted for creatinine numeric Logarithm base 2 DEP DEP
hs_detp_cadj_Log2 hs_detp_cadj_Log2 Chemicals Organophosphate pesticides DETP Postnatal NA NA Diethyl thiophosphate (DETP) in child adjusted for creatinine numeric Logarithm base 2 DETP DETP
hs_dmdtp_cdich_None hs_dmdtp_cdich_None Chemicals Organophosphate pesticides DMDTP Postnatal NA NA Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child factor None DMDTP DMDTP
hs_dmp_cadj_Log2 hs_dmp_cadj_Log2 Chemicals Organophosphate pesticides DMP Postnatal NA NA Dimethyl phosphate (DMP) in child adjusted for creatinine numeric Logarithm base 2 DMP DMP
hs_dmtp_cadj_Log2 hs_dmtp_cadj_Log2 Chemicals Organophosphate pesticides DMTP Postnatal NA NA Dimethyl thiophosphate (DMTP) in child adjusted for creatinine numeric Logarithm base 2 DMDTP DMTP
hs_pbde153_cadj_Log2 hs_pbde153_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE153 Postnatal NA NA Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids numeric Logarithm base 2 PBDE 153 PBDE153
hs_pbde47_cadj_Log2 hs_pbde47_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE47 Postnatal NA NA Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids numeric Logarithm base 2 PBDE 47 PBDE47
hs_pfhxs_c_Log2 hs_pfhxs_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFHXS Postnatal NA NA Perfluorohexane sulfonate (PFHXS) in child numeric Logarithm base 2 PFHXS PFHXS
hs_pfna_c_Log2 hs_pfna_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFNA Postnatal NA NA Perfluorononanoate (PFNA) in child numeric Logarithm base 2 PFNA PFNA
hs_pfoa_c_Log2 hs_pfoa_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOA Postnatal NA NA Perfluorooctanoate (PFOA) in child numeric Logarithm base 2 PFOA PFOA
hs_pfos_c_Log2 hs_pfos_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOS Postnatal NA NA Perfluorooctane sulfonate (PFOS) in child numeric Logarithm base 2 PFOS PFOS
hs_pfunda_c_Log2 hs_pfunda_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFUNDA Postnatal NA NA Perfluoroundecanoate (PFUNDA) in child numeric Logarithm base 2 PFUNDA PFUNDA
hs_bpa_cadj_Log2 hs_bpa_cadj_Log2 Chemicals Phenols BPA Postnatal NA NA Bisphenol A (BPA) in child adjusted for creatinine numeric Logarithm base 2 BPA BPA
hs_bupa_cadj_Log2 hs_bupa_cadj_Log2 Chemicals Phenols BUPA Postnatal NA NA N-Butyl paraben (BUPA) in child adjusted for creatinine numeric Logarithm base 2 BUPA BUPA
hs_etpa_cadj_Log2 hs_etpa_cadj_Log2 Chemicals Phenols ETPA Postnatal NA NA Ethyl paraben (ETPA) in child adjusted for creatinine numeric Logarithm base 2 ETPA ETPA
hs_mepa_cadj_Log2 hs_mepa_cadj_Log2 Chemicals Phenols MEPA Postnatal NA NA Methyl paraben (MEPA) in child adjusted for creatinine numeric Logarithm base 2 MEPA MEPA
hs_oxbe_cadj_Log2 hs_oxbe_cadj_Log2 Chemicals Phenols OXBE Postnatal NA NA Oxybenzone (OXBE) in child adjusted for creatinine numeric Logarithm base 2 OXBE OXBE
hs_prpa_cadj_Log2 hs_prpa_cadj_Log2 Chemicals Phenols PRPA Postnatal NA NA Propyl paraben (PRPA) in child adjusted for creatinine numeric Logarithm base 2 PRPA PRPA
hs_trcs_cadj_Log2 hs_trcs_cadj_Log2 Chemicals Phenols TRCS Postnatal NA NA Triclosan (TRCS) in child adjusted for creatinine numeric Logarithm base 2 TRCS TRCS
hs_mbzp_cadj_Log2 hs_mbzp_cadj_Log2 Chemicals Phthalates MBZP Postnatal NA NA Mono benzyl phthalate (MBzP) in child adjusted for creatinine numeric Logarithm base 2 MBZP MBZP
hs_mecpp_cadj_Log2 hs_mecpp_cadj_Log2 Chemicals Phthalates MECPP Postnatal NA NA Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine numeric Logarithm base 2 MECPP MECPP
hs_mehhp_cadj_Log2 hs_mehhp_cadj_Log2 Chemicals Phthalates MEHHP Postnatal NA NA Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine numeric Logarithm base 2 MEHHP MEHHP
hs_mehp_cadj_Log2 hs_mehp_cadj_Log2 Chemicals Phthalates MEHP Postnatal NA NA Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine numeric Logarithm base 2 MEHP MEHP
hs_meohp_cadj_Log2 hs_meohp_cadj_Log2 Chemicals Phthalates MEOHP Postnatal NA NA Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine numeric Logarithm base 2 MEOHP MEOHP
hs_mep_cadj_Log2 hs_mep_cadj_Log2 Chemicals Phthalates MEP Postnatal NA NA Monoethyl phthalate (MEP) in child adjusted for creatinine numeric Logarithm base 2 MEP MEP
hs_mibp_cadj_Log2 hs_mibp_cadj_Log2 Chemicals Phthalates MIBP Postnatal NA NA Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine numeric Logarithm base 2 MIBP MIBP
hs_mnbp_cadj_Log2 hs_mnbp_cadj_Log2 Chemicals Phthalates MNBP Postnatal NA NA Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine numeric Logarithm base 2 MNBP MNBP
hs_ohminp_cadj_Log2 hs_ohminp_cadj_Log2 Chemicals Phthalates OHMiNP Postnatal NA NA Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine numeric Logarithm base 2 OHMiNP OHMiNP
hs_oxominp_cadj_Log2 hs_oxominp_cadj_Log2 Chemicals Phthalates OXOMINP Postnatal NA NA Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine numeric Logarithm base 2 OXOMINP OXOMINP
hs_sumDEHP_cadj_Log2 hs_sumDEHP_cadj_Log2 Chemicals Phthalates DEHP Postnatal NA NA Sum of DEHP metabolites (µg/g) in child adjusted for creatinine numeric Logarithm base 2 DEHP SumDEHP
FAS_cat_None FAS_cat_None Chemicals Social and economic capital Economic capital Postnatal NA NA Family affluence score factor None Family affluence FamAfl
hs_contactfam_3cat_num_None hs_contactfam_3cat_num_None Chemicals Social and economic capital Social capital Postnatal NA NA scoial capital: family friends factor None Social contact SocCont
hs_hm_pers_None hs_hm_pers_None Chemicals Social and economic capital Social capital Postnatal NA NA How many people live in your home? numeric None House crowding HouseCrow
hs_participation_3cat_None hs_participation_3cat_None Chemicals Social and economic capital Social capital Postnatal NA NA social capital: structural factor None Social participation SocPartic
hs_cotinine_cdich_None hs_cotinine_cdich_None Chemicals Tobacco Smoke Cotinine Postnatal NA NA Dichotomous variable of cotinine in child factor None Cotinine Cotinine
hs_globalexp2_None hs_globalexp2_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Global exposure of the child to ETS (2 categories) factor None ETS ETS
hs_smk_parents_None hs_smk_parents_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Tobacco Smoke status of parents (both) factor None Smoking_parents SmokPar
e3_sex_None e3_sex_None Covariates Covariates Child covariate Pregnancy NA NA Child sex (female / male) factor None Child sex Sex
e3_yearbir_None e3_yearbir_None Covariates Covariates Child covariate Pregnancy NA NA Year of birth (2003 to 2009) factor None Year of birth YearBirth
h_cohort h_cohort Covariates Covariates Maternal covariate Pregnancy NA NA Cohort of inclusion (1 to 6) factor None Cohort Cohort
h_edumc_None h_edumc_None Covariates Covariates Maternal covariate Pregnancy NA NA Maternal education (1: primary school, 2:secondary school, 3:university degree or higher) factor None Maternal education mEducation
hs_child_age_None hs_child_age_None Covariates Covariates Child covariate Postnatal NA NA Child age at examination (years) numeric None Child age cAge
hs_zbmi_who hs_zbmi_who Phenotype Phenotype Outcome at 6-11 years old Postnatal NA NA Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age numeric None Body mass index z-score zBMI

Data Summary for Exposures, Covariates, and Outcome

Data Summary Exposures: Lifestyles

Lifestyle_Exposures <- combined_codebook$variable_name[combined_codebook$domain=="Lifestyles"]
lifestyle_exposome <- dplyr::select(exposome, all_of(Lifestyle_Exposures))
summarytools::view(dfSummary(lifestyle_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 h_bfdur_Ter [factor]
1. (0,10.8]
2. (10.8,34.9]
3. (34.9,Inf]
506(38.9%)
270(20.8%)
525(40.4%)
0 (0.0%)
2 hs_bakery_prod_Ter [factor]
1. (0,2]
2. (2,6]
3. (6,Inf]
345(26.5%)
423(32.5%)
533(41.0%)
0 (0.0%)
3 hs_beverages_Ter [factor]
1. (0,0.132]
2. (0.132,1]
3. (1,Inf]
331(25.4%)
454(34.9%)
516(39.7%)
0 (0.0%)
4 hs_break_cer_Ter [factor]
1. (0,1.1]
2. (1.1,5.5]
3. (5.5,Inf]
291(22.4%)
521(40.0%)
489(37.6%)
0 (0.0%)
5 hs_caff_drink_Ter [factor]
1. (0,0.132]
2. (0.132,Inf]
808(62.1%)
493(37.9%)
0 (0.0%)
6 hs_dairy_Ter [factor]
1. (0,14.6]
2. (14.6,25.6]
3. (25.6,Inf]
359(27.6%)
465(35.7%)
477(36.7%)
0 (0.0%)
7 hs_fastfood_Ter [factor]
1. (0,0.132]
2. (0.132,0.5]
3. (0.5,Inf]
143(11.0%)
603(46.3%)
555(42.7%)
0 (0.0%)
8 hs_KIDMED_None [numeric]
Mean (sd) : 2.9 (1.8)
min ≤ med ≤ max:
-3 ≤ 3 ≤ 9
IQR (CV) : 2 (0.6)
13 distinct values 0 (0.0%)
9 hs_mvpa_prd_alt_None [numeric]
Mean (sd) : 37.9 (23.1)
min ≤ med ≤ max:
-27.8 ≤ 34.7 ≤ 146.8
IQR (CV) : 24.5 (0.6)
847 distinct values 0 (0.0%)
10 hs_org_food_Ter [factor]
1. (0,0.132]
2. (0.132,1]
3. (1,Inf]
429(33.0%)
396(30.4%)
476(36.6%)
0 (0.0%)
11 hs_proc_meat_Ter [factor]
1. (0,1.5]
2. (1.5,4]
3. (4,Inf]
366(28.1%)
471(36.2%)
464(35.7%)
0 (0.0%)
12 hs_readymade_Ter [factor]
1. (0,0.132]
2. (0.132,0.5]
3. (0.5,Inf]
327(25.1%)
296(22.8%)
678(52.1%)
0 (0.0%)
13 hs_sd_wk_None [numeric]
Mean (sd) : 235.8 (126.7)
min ≤ med ≤ max:
3.1 ≤ 210 ≤ 994.3
IQR (CV) : 127.1 (0.5)
368 distinct values 0 (0.0%)
14 hs_total_bread_Ter [factor]
1. (0,7]
2. (7,17.5]
3. (17.5,Inf]
431(33.1%)
381(29.3%)
489(37.6%)
0 (0.0%)
15 hs_total_cereal_Ter [factor]
1. (0,14.1]
2. (14.1,23.6]
3. (23.6,Inf]
418(32.1%)
442(34.0%)
441(33.9%)
0 (0.0%)
16 hs_total_fish_Ter [factor]
1. (0,1.5]
2. (1.5,3]
3. (3,Inf]
389(29.9%)
454(34.9%)
458(35.2%)
0 (0.0%)
17 hs_total_fruits_Ter [factor]
1. (0,7]
2. (7,14.1]
3. (14.1,Inf]
413(31.7%)
407(31.3%)
481(37.0%)
0 (0.0%)
18 hs_total_lipids_Ter [factor]
1. (0,3]
2. (3,7]
3. (7,Inf]
397(30.5%)
403(31.0%)
501(38.5%)
0 (0.0%)
19 hs_total_meat_Ter [factor]
1. (0,6]
2. (6,9]
3. (9,Inf]
425(32.7%)
411(31.6%)
465(35.7%)
0 (0.0%)
20 hs_total_potatoes_Ter [factor]
1. (0,3]
2. (3,4]
3. (4,Inf]
417(32.1%)
405(31.1%)
479(36.8%)
0 (0.0%)
21 hs_total_sweets_Ter [factor]
1. (0,4.1]
2. (4.1,8.5]
3. (8.5,Inf]
344(26.4%)
516(39.7%)
441(33.9%)
0 (0.0%)
22 hs_total_veg_Ter [factor]
1. (0,6]
2. (6,8.5]
3. (8.5,Inf]
404(31.1%)
314(24.1%)
583(44.8%)
0 (0.0%)
23 hs_total_yog_Ter [factor]
1. (0,6]
2. (6,8.5]
3. (8.5,Inf]
779(59.9%)
308(23.7%)
214(16.4%)
0 (0.0%)
24 hs_dif_hours_total_None [numeric]
Mean (sd) : 10.3 (0.7)
min ≤ med ≤ max:
7.9 ≤ 10.3 ≤ 12.9
IQR (CV) : 0.9 (0.1)
437 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-06-24

#separate numeric and categorical data
numeric_lifestyle <- lifestyle_exposome %>% 
  dplyr::select(where(is.numeric))

numeric_lifestyle_long <- pivot_longer(
  numeric_lifestyle,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_lifestyle_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_lifestyle_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

The Sum of KIDMED indices, without index9 histogram displays a multimodal distribution with peaks primarily at scores 0, 3, 5, 7, and 10. This suggests that the dataset has several subgroups within the population, each characterized by distinct dietary habits or patterns as measured by the KIDMED index, which assesses adherence to the Mediterranean diet. The distribution is discrete, reflecting integer scores that children have received based on their dietary intake. The modes indicate the most common dietary patterns, suggesting possible clusters of dietary behavior among the children sampled.

The second histogram depicts the distribution of clean and over-reported moderate-to-vigorous physical activity (MVPA) in minutes per day. This histogram shows a right-skewed distribution, indicating that most children report lower levels of physical activity, with a smaller number of children reporting very high levels of activity, which might be over-reported. The peak near the lower end suggests that a significant portion of the sample engages in minimal to moderate amounts of MVPA, while the long tail to the right hints at a few cases with unusually high reported values, possibly due to over-reporting or measurement errors in data collection.

For sedentary behavior in minutes per day, there is a distribution that is slightly left-skewed in the histogram. Most children tend to have higher sedentary time, with a concentration of values towards the right side of the histogram. The distribution suggests that fewer children engage in lower levels of sedentary behavior, indicating a trend towards more inactivity among the sample. This pattern might raise concerns regarding lifestyle habits that contribute to prolonged periods of low physical activity.

The distribution of total hours of sleep per night (averaged over weekdays and weekends) exhibits a nearly normal distribution. This suggests that most children in the study have a consistent sleep duration with the bulk of the data clustering around the mean. The symmetry of the distribution indicates a healthy variance in sleep hours among the children, without significant extremes in either insufficient or excessive sleep, which is a positive indication of regular sleep patterns in this population.

categorical_lifestyle <- lifestyle_exposome %>% 
  dplyr::select(where(is.factor))

categorical_lifestyle_long <- pivot_longer(
  categorical_lifestyle,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_lifestyle_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_lifestyle_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.

Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.

Beverages: A significant number of participants consume beverages at the highest level, indicating a preference or higher consumption of beverages like sodas.

Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.

Caffeinated/Energy Drinks: Displays a high number of participants avoiding or consuming very low quantities of caffeinated or energy drinks.

Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.

Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.

Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.

Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.

Ready-Made Meals: Many participants rarely consume ready-made meals, with a significant number also in the highest consumption category.

Bread: Distribution shows a significant leaning towards higher bread consumption.

Cereal: Even distribution across categories suggests varied cereal consumption habits.

Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.

Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.

Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.

Meat: Consumption of meat is highest in the middle category.

Potatoes: Shows a tendency towards either low or high consumption, with fewer people in the middle range.

Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.

Vegetables: Most participants consume a high amount of vegetables.

Yogurt: Shows a preference for either very high or very low yogurt consumption, with fewer participants in the middle.

numeric_lifestyle <- select_if(lifestyle_exposome, is.numeric)
cor_matrix <- cor(numeric_lifestyle, method = "pearson")
cor_matrix <- cor(numeric_lifestyle, method = "spearman")
corrplot(cor_matrix, method = "circle")

Data Summary Exposures: Chemicals

Chemical_Exposures <- combined_codebook$variable_name[combined_codebook$domain=="Chemicals"]
chemical_exposome <- exposome %>%
  dplyr::select(all_of(Chemical_Exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_as_c_Log2 [numeric]
Mean (sd) : -1 (3.3)
min ≤ med ≤ max:
-15 ≤ 0.5 ≤ 4.8
IQR (CV) : 5.3 (-3.3)
692 distinct values 0 (0.0%)
2 hs_cd_c_Log2 [numeric]
Mean (sd) : -4 (1)
min ≤ med ≤ max:
-10.4 ≤ -3.8 ≤ 0.8
IQR (CV) : 1 (-0.3)
695 distinct values 0 (0.0%)
3 hs_co_c_Log2 [numeric]
Mean (sd) : -2.3 (0.6)
min ≤ med ≤ max:
-5.5 ≤ -2.4 ≤ 1.4
IQR (CV) : 0.7 (-0.3)
317 distinct values 0 (0.0%)
4 hs_cs_c_Log2 [numeric]
Mean (sd) : 0.4 (0.6)
min ≤ med ≤ max:
-1.5 ≤ 0.5 ≤ 3.1
IQR (CV) : 0.8 (1.3)
369 distinct values 0 (0.0%)
5 hs_cu_c_Log2 [numeric]
Mean (sd) : 9.8 (0.2)
min ≤ med ≤ max:
9.1 ≤ 9.8 ≤ 12.1
IQR (CV) : 0.3 (0)
345 distinct values 0 (0.0%)
6 hs_hg_c_Log2 [numeric]
Mean (sd) : -0.3 (1.7)
min ≤ med ≤ max:
-10.9 ≤ -0.2 ≤ 3.7
IQR (CV) : 2.1 (-5.6)
698 distinct values 0 (0.0%)
7 hs_mn_c_Log2 [numeric]
Mean (sd) : 3.1 (0.4)
min ≤ med ≤ max:
1.7 ≤ 3.1 ≤ 4.8
IQR (CV) : 0.6 (0.1)
457 distinct values 0 (0.0%)
8 hs_mo_c_Log2 [numeric]
Mean (sd) : -0.3 (0.9)
min ≤ med ≤ max:
-9.2 ≤ -0.4 ≤ 5.1
IQR (CV) : 0.8 (-2.9)
593 distinct values 0 (0.0%)
9 hs_pb_c_Log2 [numeric]
Mean (sd) : 3.1 (0.6)
min ≤ med ≤ max:
1.1 ≤ 3.1 ≤ 7.7
IQR (CV) : 0.8 (0.2)
529 distinct values 0 (0.0%)
10 hs_tl_cdich_None [factor]
1. Detected
2. Undetected
102(7.8%)
1199(92.2%)
0 (0.0%)
11 hs_dde_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1.5)
min ≤ med ≤ max:
1.2 ≤ 4.5 ≤ 11.1
IQR (CV) : 1.9 (0.3)
1050 distinct values 0 (0.0%)
12 hs_ddt_cadj_Log2 [numeric]
Mean (sd) : -1.6 (3.7)
min ≤ med ≤ max:
-15.4 ≤ -0.5 ≤ 7.6
IQR (CV) : 2.5 (-2.3)
1039 distinct values 0 (0.0%)
13 hs_hcb_cadj_Log2 [numeric]
Mean (sd) : 3.2 (0.9)
min ≤ med ≤ max:
-13.1 ≤ 3.1 ≤ 6.5
IQR (CV) : 0.9 (0.3)
1036 distinct values 0 (0.0%)
14 hs_pcb118_cadj_Log2 [numeric]
Mean (sd) : 1.1 (0.8)
min ≤ med ≤ max:
-7 ≤ 1 ≤ 4.8
IQR (CV) : 1 (0.7)
1048 distinct values 0 (0.0%)
15 hs_pcb138_cadj_Log2 [numeric]
Mean (sd) : 2.4 (1.1)
min ≤ med ≤ max:
-9.4 ≤ 2.4 ≤ 7.7
IQR (CV) : 1.4 (0.5)
1031 distinct values 0 (0.0%)
16 hs_pcb153_cadj_Log2 [numeric]
Mean (sd) : 3.6 (0.9)
min ≤ med ≤ max:
1.2 ≤ 3.5 ≤ 7.8
IQR (CV) : 1.4 (0.3)
1047 distinct values 0 (0.0%)
17 hs_pcb170_cadj_Log2 [numeric]
Mean (sd) : -0.3 (3)
min ≤ med ≤ max:
-16.8 ≤ 0.3 ≤ 4.8
IQR (CV) : 2.2 (-9.8)
1039 distinct values 0 (0.0%)
18 hs_pcb180_cadj_Log2 [numeric]
Mean (sd) : 1.7 (1.9)
min ≤ med ≤ max:
-11.7 ≤ 1.8 ≤ 5.9
IQR (CV) : 2.3 (1.1)
1055 distinct values 0 (0.0%)
19 hs_sumPCBs5_cadj_Log2 [numeric]
Mean (sd) : 4.6 (1)
min ≤ med ≤ max:
2.2 ≤ 4.6 ≤ 9.3
IQR (CV) : 1.5 (0.2)
1052 distinct values 0 (0.0%)
20 hs_dep_cadj_Log2 [numeric]
Mean (sd) : 0.2 (3.2)
min ≤ med ≤ max:
-12.6 ≤ 0.9 ≤ 9.4
IQR (CV) : 3.3 (20)
1045 distinct values 0 (0.0%)
21 hs_detp_cadj_Log2 [numeric]
Mean (sd) : -2.4 (3.6)
min ≤ med ≤ max:
-15.4 ≤ -3.3 ≤ 6.3
IQR (CV) : 6 (-1.5)
1036 distinct values 0 (0.0%)
22 hs_dmdtp_cdich_None [factor]
1. Detected
2. Undetected
227(17.4%)
1074(82.6%)
0 (0.0%)
23 hs_dmp_cadj_Log2 [numeric]
Mean (sd) : -1.4 (4)
min ≤ med ≤ max:
-16.6 ≤ -0.3 ≤ 6.4
IQR (CV) : 7 (-2.9)
1053 distinct values 0 (0.0%)
24 hs_dmtp_cadj_Log2 [numeric]
Mean (sd) : 1.1 (2.6)
min ≤ med ≤ max:
-10.6 ≤ 1.6 ≤ 8.7
IQR (CV) : 2.4 (2.3)
1057 distinct values 0 (0.0%)
25 hs_pbde153_cadj_Log2 [numeric]
Mean (sd) : -4.5 (3.8)
min ≤ med ≤ max:
-17.6 ≤ -2.6 ≤ 4
IQR (CV) : 6.7 (-0.8)
1036 distinct values 0 (0.0%)
26 hs_pbde47_cadj_Log2 [numeric]
Mean (sd) : -2.6 (2.5)
min ≤ med ≤ max:
-15.4 ≤ -2.1 ≤ 5.4
IQR (CV) : 1.2 (-1)
1010 distinct values 0 (0.0%)
27 hs_pfhxs_c_Log2 [numeric]
Mean (sd) : -1.6 (1.3)
min ≤ med ≤ max:
-8.9 ≤ -1.4 ≤ 4.8
IQR (CV) : 1.7 (-0.8)
1061 distinct values 0 (0.0%)
28 hs_pfna_c_Log2 [numeric]
Mean (sd) : -1.1 (1.1)
min ≤ med ≤ max:
-8.1 ≤ -1.1 ≤ 2.7
IQR (CV) : 1.3 (-1)
1031 distinct values 0 (0.0%)
29 hs_pfoa_c_Log2 [numeric]
Mean (sd) : 0.6 (0.6)
min ≤ med ≤ max:
-2.2 ≤ 0.6 ≤ 2.7
IQR (CV) : 0.7 (0.9)
1061 distinct values 0 (0.0%)
30 hs_pfos_c_Log2 [numeric]
Mean (sd) : 1 (1.1)
min ≤ med ≤ max:
-10.4 ≤ 1 ≤ 5.1
IQR (CV) : 1.3 (1.1)
1050 distinct values 0 (0.0%)
31 hs_pfunda_c_Log2 [numeric]
Mean (sd) : -4.2 (1.6)
min ≤ med ≤ max:
-11.8 ≤ -4.1 ≤ 0.6
IQR (CV) : 1.7 (-0.4)
1044 distinct values 0 (0.0%)
32 hs_bpa_cadj_Log2 [numeric]
Mean (sd) : 2.1 (1.5)
min ≤ med ≤ max:
-7.2 ≤ 2 ≤ 7.8
IQR (CV) : 1.6 (0.7)
1056 distinct values 0 (0.0%)
33 hs_bupa_cadj_Log2 [numeric]
Mean (sd) : -3.5 (2)
min ≤ med ≤ max:
-13.9 ≤ -3.5 ≤ 6.6
IQR (CV) : 1.8 (-0.6)
1034 distinct values 0 (0.0%)
34 hs_etpa_cadj_Log2 [numeric]
Mean (sd) : -0.1 (1.9)
min ≤ med ≤ max:
-6.1 ≤ -0.6 ≤ 11
IQR (CV) : 1.6 (-14.3)
1066 distinct values 0 (0.0%)
35 hs_mepa_cadj_Log2 [numeric]
Mean (sd) : 3.4 (2.5)
min ≤ med ≤ max:
-6.9 ≤ 2.7 ≤ 14.5
IQR (CV) : 3 (0.7)
1052 distinct values 0 (0.0%)
36 hs_oxbe_cadj_Log2 [numeric]
Mean (sd) : 1.5 (2.4)
min ≤ med ≤ max:
-4.1 ≤ 1.1 ≤ 13
IQR (CV) : 3 (1.6)
1069 distinct values 0 (0.0%)
37 hs_prpa_cadj_Log2 [numeric]
Mean (sd) : -1.6 (3.8)
min ≤ med ≤ max:
-12 ≤ -2.3 ≤ 10.8
IQR (CV) : 5.2 (-2.4)
1031 distinct values 0 (0.0%)
38 hs_trcs_cadj_Log2 [numeric]
Mean (sd) : -0.4 (2)
min ≤ med ≤ max:
-4.4 ≤ -0.7 ≤ 9.3
IQR (CV) : 2.2 (-5.6)
1053 distinct values 0 (0.0%)
39 hs_mbzp_cadj_Log2 [numeric]
Mean (sd) : 2.4 (1.2)
min ≤ med ≤ max:
-0.6 ≤ 2.3 ≤ 7.2
IQR (CV) : 1.5 (0.5)
1046 distinct values 0 (0.0%)
40 hs_mecpp_cadj_Log2 [numeric]
Mean (sd) : 5.2 (1.1)
min ≤ med ≤ max:
2.6 ≤ 5.1 ≤ 10.6
IQR (CV) : 1.5 (0.2)
1037 distinct values 0 (0.0%)
41 hs_mehhp_cadj_Log2 [numeric]
Mean (sd) : 4.4 (1.1)
min ≤ med ≤ max:
1.8 ≤ 4.4 ≤ 11.1
IQR (CV) : 1.4 (0.2)
1050 distinct values 0 (0.0%)
42 hs_mehp_cadj_Log2 [numeric]
Mean (sd) : 1.6 (1.2)
min ≤ med ≤ max:
-1.6 ≤ 1.6 ≤ 8.1
IQR (CV) : 1.5 (0.7)
1035 distinct values 0 (0.0%)
43 hs_meohp_cadj_Log2 [numeric]
Mean (sd) : 3.7 (1.1)
min ≤ med ≤ max:
1.1 ≤ 3.6 ≤ 10.3
IQR (CV) : 1.5 (0.3)
1057 distinct values 0 (0.0%)
44 hs_mep_cadj_Log2 [numeric]
Mean (sd) : 5.3 (1.6)
min ≤ med ≤ max:
1.7 ≤ 5.1 ≤ 11.6
IQR (CV) : 2.2 (0.3)
1075 distinct values 0 (0.0%)
45 hs_mibp_cadj_Log2 [numeric]
Mean (sd) : 5.5 (1.1)
min ≤ med ≤ max:
2.3 ≤ 5.4 ≤ 9.8
IQR (CV) : 1.5 (0.2)
1057 distinct values 0 (0.0%)
46 hs_mnbp_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1)
min ≤ med ≤ max:
1.9 ≤ 4.6 ≤ 8.9
IQR (CV) : 1.3 (0.2)
1048 distinct values 0 (0.0%)
47 hs_ohminp_cadj_Log2 [numeric]
Mean (sd) : 2.6 (1.2)
min ≤ med ≤ max:
-0.3 ≤ 2.4 ≤ 9.1
IQR (CV) : 1.5 (0.5)
1085 distinct values 0 (0.0%)
48 hs_oxominp_cadj_Log2 [numeric]
Mean (sd) : 1.7 (1.2)
min ≤ med ≤ max:
-0.9 ≤ 1.5 ≤ 9.4
IQR (CV) : 1.4 (0.7)
1059 distinct values 0 (0.0%)
49 hs_sumDEHP_cadj_Log2 [numeric]
Mean (sd) : 6 (1.2)
min ≤ med ≤ max:
2.6 ≤ 6 ≤ 10.1
IQR (CV) : 1.6 (0.2)
1028 distinct values 0 (0.0%)
50 FAS_cat_None [factor]
1. Low
2. Middle
3. High
146(11.2%)
486(37.4%)
669(51.4%)
0 (0.0%)
51 hs_contactfam_3cat_num_None [factor]
1. (almost) Daily
2. Once a week
3. Less than once a week
863(66.3%)
382(29.4%)
56(4.3%)
0 (0.0%)
52 hs_hm_pers_None [numeric]
Mean (sd) : 4.2 (1)
min ≤ med ≤ max:
1 ≤ 4 ≤ 10
IQR (CV) : 1 (0.2)
1:2(0.2%)
2:36(2.8%)
3:180(13.8%)
4:670(51.5%)
5:297(22.8%)
6:85(6.5%)
7:17(1.3%)
8:8(0.6%)
9:5(0.4%)
10:1(0.1%)
0 (0.0%)
53 hs_participation_3cat_None [factor]
1. None
2. 1 organisation
3. 2 or more organisations
748(57.5%)
355(27.3%)
198(15.2%)
0 (0.0%)
54 hs_cotinine_cdich_None [factor]
1. Detected
2. Undetected
223(17.1%)
1078(82.9%)
0 (0.0%)
55 hs_globalexp2_None [factor]
1. exposure
2. no exposure
463(35.6%)
838(64.4%)
0 (0.0%)
56 hs_smk_parents_None [factor]
1. both
2. neither
3. one
142(10.9%)
814(62.6%)
345(26.5%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-06-24

#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>% 
  dplyr::select(where(is.numeric))

numeric_chemical_long <- pivot_longer(
  numeric_chemical,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_chemical_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_chemical_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Arsenic (hs_as_c_Log2): This histogram shows a bimodal distribution of arsenic levels, with two prominent peaks. Such a distribution might suggest two different populations or sources of exposure among the study participants.

Cadmium (hs_cd_c_Log2): The distribution of cadmium levels is skewed to the right, indicating that most participants have lower exposure levels, with a few cases showing significantly higher exposures.

Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew. This suggests a common source of exposure with varying levels among the population.

Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures.

Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.

Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.

Manganese (hs_mn_c_Log2): The histogram for manganese displays a bell-shaped distribution, indicating a normal distribution of manganese levels among the participants.

Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures.

Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.

DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population.

DDT (hs_ddt_cadj_Log2): This histogram displays a multi-modal distribution, suggesting different sources or durations of exposure among the population.

Hexachlorobenzene (hs_hcb_cadj_Log2): Exhibits a right-skewed distribution with a long tail, indicating that most people have lower exposure levels with some outliers experiencing very high exposures.

PCB 118, 138, 153 (hs_pcb118_cadj_Log2, hs_pcb138_cadj_Log2, hs_pcb153_cadj_Log2): All three PCBs show similar distributions with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population.

PCB 170 and PCB 180: Both histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.

Sum of PCBs: The histogram is approximately normally distributed, centered around a higher value compared to individual PCBs, indicating a collective higher average exposure when all measured PCBs are considered together.

DEP, DETP, DMTP, DMDTP, PBDE 153, and PBDE 47: These histograms mostly show multimodal distributions (more than one peak), suggesting different exposure sources or groups within the population that have distinct exposure levels. The multiple peaks could indicate varied exposure pathways or differences in how these chemicals are metabolized or retained in the body.

PFHxS, PFNA, and PFOA: These perfluorinated compounds display a roughly normal distribution skewed right, suggesting a common source of exposure among the population, but with some individuals experiencing higher exposures.

PFOS and PFUnDA: The histograms show a single, sharp peak with a rapid decline, indicating that most individuals have similar exposure levels, likely due to common environmental sources or regulatory controls limiting variability.

BPA: The histogram is sharply peaked near zero with a long tail to the right, indicating low exposure for most individuals but significant exposure for a few, possibly due to specific product use or occupational exposure.

MBZP (Monobenzyl Phthalate): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures.

MECPP (Mono-ethyl hexyl phthalate): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.

MEHHP (Mono-2-ethyl-5-hydroxyhexyl phthalate): Exhibits a unimodal distribution with a peak around a middle value and symmetric tails. This could indicate a more standardized exposure level among the subjects with some variation.

MEHP (Mono-ethylhexyl phthalate):Another right-skewed distribution, indicating that most subjects have lower exposure levels but a few have much higher levels.

MEOHP (Mono-2-ethyl-5-oxohexyl phthalate): This histogram shows a distribution with a peak around the middle values and a tail extending towards higher values, suggesting a central tendency with some higher exposures.

MEP (Mono-ethyl phthalate): The distribution is right-skewed, similar to others, showing most subjects with low to moderate levels of exposure, but a few have much higher levels.

OXINP (Oxidized Isoparaffin): This histogram shows a central peak with a fast decline, indicating a concentration of values around a specific point which might suggest a common exposure level among the subjects.

Sum of DEHP Metabolites: This shows a broad distribution with a peak towards the lower end, indicating varied exposure levels among the subjects, with most experiencing lower exposures.

Personal Care Product Use: The histogram displays a highly skewed distribution with multiple peaks, reflecting varied usage patterns among subjects, with some showing particularly high usage levels.

categorical_chemical <- chemical_exposome %>% 
  dplyr::select(where(is.factor))

categorical_chemical_long <- pivot_longer(
  categorical_chemical,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_chemical_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_chemical_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

hs_t_cdich_None (Detected vs. Undetected):The vast majority of samples were undetected for this particular chemical, with only a small fraction showing detection.

hs_dmdtp_cdich_None (Detected vs. Undetected): Similar to the previous, most samples were undetected, but a higher proportion shows detection compared to the first chemical.

FAS_cat_None (Family Affluence Scale categories - Low, Middle, High): This shows the distribution of family affluence categories where the largest group is the high affluence, followed by middle, with the fewest in the low category.

hs_contactfam_3cat_num_None (Frequency of contact with family): Most individuals reported daily (almost daily) contact with family, a smaller number reported weekly contact, and the fewest reported less frequent than weekly contact.

hs_participation_3cat_None (Participation in organisations): A large number of individuals do not participate in any organisation, a substantial number participate in one, and a smaller group in two or more.

hs_cotinine_cdich_None (Detected vs. Undetected): Cotinine detection is high, indicating exposure to nicotine, with a significant number of samples showing detection versus undetected.

hs_globalexp2_None (Global Exposure - Exposure vs. No Exposure): This represents overall exposure to some condition or factor, with a larger proportion having no exposure compared to those with exposure.

hs_smk_parents_None (Smoking status of parents - Both, Neither, One): The largest group reported that neither parent smokes, a significant number reported one smoking parent, and the smallest group reported both parents smoke.

numeric_chemical <- select_if(chemical_exposome, is.numeric)
cor_matrix <- cor(numeric_chemical, method = "pearson")
cor_matrix <- cor(numeric_chemical, method = "spearman")
custom_color_scale <- list(
  c(0, "darkred"),    
  c(0.5, "white"), 
  c(1, "darkblue")
)

plot_ly(
  z = cor_matrix, 
  x = colnames(cor_matrix), 
  y = colnames(cor_matrix), 
  type = "heatmap",
  colorscale = custom_color_scale
) %>%
layout(
  title = "Correlation Matrix",
  xaxis = list(tickangle = -90),
  yaxis = list(side = "left")
)

Data Summary Covariates

summarytools::view(dfSummary(covariates, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 ID [integer]
Mean (sd) : 651 (375.7)
min ≤ med ≤ max:
1 ≤ 651 ≤ 1301
IQR (CV) : 650 (0.6)
1301 distinct values (Integer sequence) 0 (0.0%)
2 h_cohort [factor]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
202(15.5%)
198(15.2%)
224(17.2%)
207(15.9%)
272(20.9%)
198(15.2%)
0 (0.0%)
3 e3_sex_None [factor]
1. female
2. male
608(46.7%)
693(53.3%)
0 (0.0%)
4 e3_yearbir_None [factor]
1. 2003
2. 2004
3. 2005
4. 2006
5. 2007
6. 2008
7. 2009
55(4.2%)
107(8.2%)
241(18.5%)
256(19.7%)
250(19.2%)
379(29.1%)
13(1.0%)
0 (0.0%)
5 h_mbmi_None [numeric]
Mean (sd) : 25 (5.2)
min ≤ med ≤ max:
15.9 ≤ 24 ≤ 51.4
IQR (CV) : 6.1 (0.2)
853 distinct values 0 (0.0%)
6 hs_wgtgain_None [numeric]
Mean (sd) : 13.5 (6.2)
min ≤ med ≤ max:
0 ≤ 12 ≤ 55
IQR (CV) : 9 (0.5)
49 distinct values 0 (0.0%)
7 e3_gac_None [numeric]
Mean (sd) : 39.6 (1.7)
min ≤ med ≤ max:
28 ≤ 40 ≤ 44.1
IQR (CV) : 2 (0)
72 distinct values 0 (0.0%)
8 h_age_None [numeric]
Mean (sd) : 30.8 (4.9)
min ≤ med ≤ max:
16 ≤ 31 ≤ 43.5
IQR (CV) : 6.4 (0.2)
665 distinct values 0 (0.0%)
9 h_edumc_None [factor]
1. 1
2. 2
3. 3
178(13.7%)
449(34.5%)
674(51.8%)
0 (0.0%)
10 h_native_None [factor]
1. 0
2. 1
3. 2
146(11.2%)
67(5.1%)
1088(83.6%)
0 (0.0%)
11 h_parity_None [factor]
1. 0
2. 1
3. 2
601(46.2%)
464(35.7%)
236(18.1%)
0 (0.0%)
12 hs_child_age_None [numeric]
Mean (sd) : 8 (1.6)
min ≤ med ≤ max:
5.4 ≤ 8 ≤ 12.1
IQR (CV) : 2.4 (0.2)
879 distinct values 0 (0.0%)
13 hs_c_height_None [numeric]
Mean (sd) : 1.3 (0.1)
min ≤ med ≤ max:
1.1 ≤ 1.3 ≤ 1.7
IQR (CV) : 0.2 (0.1)
311 distinct values 0 (0.0%)
14 hs_c_weight_None [numeric]
Mean (sd) : 28.5 (7.7)
min ≤ med ≤ max:
16 ≤ 26.9 ≤ 71.1
IQR (CV) : 9.8 (0.3)
311 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-06-24

#separate numeric and categorical data
numeric_covariates <- covariates %>% 
  dplyr::select(where(is.numeric))

numeric_covariates_long <- pivot_longer(
  numeric_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_covariates_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_covariates_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

ID: This histogram appears to show a uniform distribution of IDs over a range, with all IDs evenly spaced. This typical pattern is expected in a dataset where IDs are systematically assigned.

Maternal BMI (h_mbmi): The distribution of maternal BMI is roughly normal but slightly right-skewed, indicating that more individuals are on the higher side of the BMI scale. The peak of the histogram around the 25-30 range suggests a concentration of values in this area, which is typical for adult populations.

Weight Gain (hs_wgtgain): This histogram displays a bimodal distribution of weight gain, with significant peaks around 10 and another around 20. This could indicate two common patterns or recommendations in weight gain during pregnancy or another health-related period.

Gestational Age at Childbirth (e3_gac): The distribution is centered around the 40-week mark, which is typical for full-term pregnancies. There is a sharp peak at around 40 weeks, showing that most childbirths occur at this gestational age.

Maternal Age (h_age): This histogram shows a roughly normal distribution with a peak around the early 30s, suggesting that this is the most common age range for the mothers in the dataset.

Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.

Child’s Height (hs_c_height): The data is approximately normally distributed with a slight right skew. The majority of the measurements cluster around the mean, which suggests typical growth patterns.

Child’s Weight (hs_c_weight): This histogram is right-skewed, indicating that while most children’s weights are within a normal range, there is a long tail of children who weigh more, which might suggest variations in growth or cases of overweight.

categorical_covariates <- covariates %>% 
  dplyr::select(where(is.factor))

categorical_covariates_long <- pivot_longer(
  categorical_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_covariates_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Cohorts (h_cohort): The distribution shows the count of subjects across six different cohorts. All cohorts have a substantial number of subjects, with cohort 5 showing the highest participation.

Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.

Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.

Educational Level (h_educmc): Represents three categories of educational attainment, with category 3 having the highest count, suggesting a higher level of education among the majority of the subjects.

Native Language (h_native): Shows the count of parents by their native country status. The majority are from category 2.

Parity (h_parity): The chart categorizes subjects based on the number of children they have. The largest group is those with no children, followed by those with one child, and a smaller group with two children.

numeric_covariate <- select_if(covariates, is.numeric)
cor_matrix <- cor(numeric_covariate, method = "pearson")
cor_matrix <- cor(numeric_covariate, method = "spearman")
corrplot(cor_matrix, method = "circle")

Data Summary Outcome: Phenotype

outcome_BMI <- phenotype %>% 
  dplyr::select(hs_zbmi_who, hs_bmi_c_cat)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_zbmi_who [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
-3.6 ≤ 0.3 ≤ 4.7
IQR (CV) : 1.5 (3)
421 distinct values 0 (0.0%)
2 hs_bmi_c_cat [factor]
1. 1
2. 2
3. 3
4. 4
13(1.0%)
904(69.5%)
253(19.4%)
131(10.1%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-06-24

Models to Consider

outcome_cov <- cbind(covariates, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
outcome_cov <- outcome_cov %>%
  dplyr::select(hs_child_age_None, h_cohort, e3_sex_None, e3_yearbir_None, h_edumc_None, h_native_None, hs_zbmi_who)
summary_table <- dfSummary(outcome_cov, 
                           varnumbers = TRUE, 
                           valid.col = FALSE, 
                           graph.col = TRUE, 
                           style = "multiline")

print(summary_table, method = "render", plain.ascii = FALSE, style = "grid")

Data Frame Summary

outcome_cov

Dimensions: 1301 x 7
Duplicates: 1
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_child_age_None [numeric]
Mean (sd) : 8 (1.6)
min ≤ med ≤ max:
5.4 ≤ 8 ≤ 12.1
IQR (CV) : 2.4 (0.2)
879 distinct values 0 (0.0%)
2 h_cohort [factor]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
202(15.5%)
198(15.2%)
224(17.2%)
207(15.9%)
272(20.9%)
198(15.2%)
0 (0.0%)
3 e3_sex_None [factor]
1. female
2. male
608(46.7%)
693(53.3%)
0 (0.0%)
4 e3_yearbir_None [factor]
1. 2003
2. 2004
3. 2005
4. 2006
5. 2007
6. 2008
7. 2009
55(4.2%)
107(8.2%)
241(18.5%)
256(19.7%)
250(19.2%)
379(29.1%)
13(1.0%)
0 (0.0%)
5 h_edumc_None [factor]
1. 1
2. 2
3. 3
178(13.7%)
449(34.5%)
674(51.8%)
0 (0.0%)
6 h_native_None [factor]
1. 0
2. 1
3. 2
146(11.2%)
67(5.1%)
1088(83.6%)
0 (0.0%)
7 hs_zbmi_who [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
-3.6 ≤ 0.3 ≤ 4.7
IQR (CV) : 1.5 (3)
421 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-06-24

#the full chemicals list
chemicals_full <- c(
  "hs_as_c_Log2",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mn_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_tl_cdich_None",
  "hs_dde_cadj_Log2",
  "hs_ddt_cadj_Log2",
  "hs_hcb_cadj_Log2",
  "hs_pcb118_cadj_Log2",
  "hs_pcb138_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_pcb180_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_dmdtp_cdich_None",
  "hs_dmp_cadj_Log2",
  "hs_dmtp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pbde47_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfna_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_pfunda_c_Log2",
  "hs_bpa_cadj_Log2",
  "hs_bupa_cadj_Log2",
  "hs_etpa_cadj_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_trcs_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mecpp_cadj_Log2",
  "hs_mehhp_cadj_Log2",
  "hs_mehp_cadj_Log2",
  "hs_meohp_cadj_Log2",
  "hs_mep_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2",
  "hs_ohminp_cadj_Log2",
  "hs_oxominp_cadj_Log2",
  "FAS_cat_None",
  "hs_contactfam_3cat_num_None",
  "hs_hm_pers_None",
  "hs_participation_3cat_None",
  "hs_cotinine_cdich_None",
  "hs_globalexp2_None",
  "hs_smk_parents_None"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_beverages_Ter",
  "hs_break_cer_Ter",
  "hs_caff_drink_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "h_legume_preg_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_cereal_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_meat_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_total_yog_Ter"
)

all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
head(extracted_exposome)

Final Selected Data

selected_data <- cbind(outcome_cov, extracted_exposome)
head(selected_data)
selected_data_corr <- select_if(selected_data, is.numeric)
cor_matrix <- cor(selected_data_corr, method = "pearson")
cor_matrix <- cor(selected_data_corr, method = "spearman")
custom_color_scale <- list(
  c(0, "darkred"),    
  c(0.5, "white"), 
  c(1, "darkblue")
)

plot_ly(
  z = cor_matrix, 
  x = colnames(cor_matrix), 
  y = colnames(cor_matrix), 
  type = "heatmap",
  colorscale = custom_color_scale
) %>%
layout(
  title = "Correlation Matrix",
  xaxis = list(tickangle = -90),
  yaxis = list(side = "left")
)

Comparing Models with and without Covariates

Chemicals Data

chemical_data_only <- selected_data[, chemicals_full]
covariate_names <- c("e3_sex_None", "e3_yearbir_None", "h_edumc_None", "h_cohort", "hs_child_age_None")
covariates_data <- selected_data[, covariate_names]

# selected and outcome variable
x <- as.matrix(selected_data[, setdiff(names(selected_data), "hs_zbmi_who")])
y <- selected_data$hs_zbmi_who

# model with covariates
fit_with_covariates <- cv.glmnet(x, y, alpha = 1, family = "gaussian")
fit_with_covariates
## 
## Call:  cv.glmnet(x = x, y = y, alpha = 1, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.00756    42   1.172 0.07102      48
## 1se 0.08489    16   1.235 0.07126      12
#model without covariates
x_chemicals_only <- as.matrix(selected_data[, chemicals_full])
fit_without_covariates <- cv.glmnet(x_chemicals_only, y, alpha = 1, family = "gaussian")

# combine chemical data and covariates for the full model
full_data <- cbind(chemical_data_only, covariates_data)
x_full <- as.matrix(full_data)

fit_with_covariates <- cv.glmnet(x_full, y, alpha = 1, family = "gaussian")
x_chemicals_only <- as.matrix(chemical_data_only)
fit_without_covariates <- cv.glmnet(x_chemicals_only, y, alpha = 1, family = "gaussian")

plot(fit_with_covariates)

plot(fit_without_covariates)

cat("Model with Covariates - Lambda Min:", fit_with_covariates$lambda.min, "\n")
## Model with Covariates - Lambda Min: 0.0120332
cat("Model without Covariates - Lambda Min:", fit_without_covariates$lambda.min, "\n")
## Model without Covariates - Lambda Min: 0.01096421

Predicting LASSO

#LASSO train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(selected_data)), size = floor(0.7 * nrow(selected_data)))
test_indices <- setdiff(seq_len(nrow(selected_data)), train_indices)

x_train <- as.matrix(selected_data[train_indices, setdiff(names(selected_data), "hs_zbmi_who")])
y_train <- selected_data$hs_zbmi_who[train_indices]

x_test <- as.matrix(selected_data[test_indices, setdiff(names(selected_data), "hs_zbmi_who")])
y_test <- selected_data$hs_zbmi_who[test_indices]

fit_with_covariates_train <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian")
fit_with_covariates_test <- predict(fit_with_covariates_train, s = "lambda.min", newx = x_test)
test_mse_with_covariates <- mean((y_test - fit_with_covariates_test)^2)

x_train_chemicals_only <- as.matrix(selected_data[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(selected_data[test_indices, chemicals_full])

fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 1, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_with_covariates_train, xvar = "lambda", main = "Coefficients Path (With Covariates)")

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_with_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates_train, s = best_lambda)  # coefficients at the chosen lambda
## 82 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -5.5678784940
## hs_child_age_None            .           
## h_cohort                     0.0695567858
## e3_sex_None                  .           
## e3_yearbir_None              .           
## h_edumc_None                 .           
## h_native_None                0.0511006743
## hs_as_c_Log2                 .           
## hs_cd_c_Log2                -0.0336774443
## hs_co_c_Log2                -0.0239636201
## hs_cs_c_Log2                 0.0820543441
## hs_cu_c_Log2                 0.6610426794
## hs_hg_c_Log2                -0.0017088529
## hs_mn_c_Log2                 .           
## hs_mo_c_Log2                -0.1156505312
## hs_pb_c_Log2                 .           
## hs_tl_cdich_None             .           
## hs_dde_cadj_Log2            -0.0657354802
## hs_ddt_cadj_Log2             .           
## hs_hcb_cadj_Log2             .           
## hs_pcb118_cadj_Log2          .           
## hs_pcb138_cadj_Log2          .           
## hs_pcb153_cadj_Log2         -0.1856578642
## hs_pcb170_cadj_Log2         -0.0573285421
## hs_pcb180_cadj_Log2          .           
## hs_dep_cadj_Log2            -0.0193032245
## hs_detp_cadj_Log2            .           
## hs_dmdtp_cdich_None          .           
## hs_dmp_cadj_Log2             .           
## hs_dmtp_cadj_Log2            .           
## hs_pbde153_cadj_Log2        -0.0328956540
## hs_pbde47_cadj_Log2          .           
## hs_pfhxs_c_Log2              .           
## hs_pfna_c_Log2               .           
## hs_pfoa_c_Log2              -0.0993929110
## hs_pfos_c_Log2              -0.0755069975
## hs_pfunda_c_Log2             .           
## hs_bpa_cadj_Log2             .           
## hs_bupa_cadj_Log2            .           
## hs_etpa_cadj_Log2            .           
## hs_mepa_cadj_Log2            .           
## hs_oxbe_cadj_Log2            0.0006650708
## hs_prpa_cadj_Log2            0.0057866608
## hs_trcs_cadj_Log2            0.0003819532
## hs_mbzp_cadj_Log2            0.0347966360
## hs_mecpp_cadj_Log2           .           
## hs_mehhp_cadj_Log2           .           
## hs_mehp_cadj_Log2            .           
## hs_meohp_cadj_Log2           .           
## hs_mep_cadj_Log2             .           
## hs_mibp_cadj_Log2           -0.0244119191
## hs_mnbp_cadj_Log2           -0.0243769631
## hs_ohminp_cadj_Log2          .           
## hs_oxominp_cadj_Log2         .           
## FAS_cat_None                 .           
## hs_contactfam_3cat_num_None  .           
## hs_hm_pers_None             -0.0028088257
## hs_participation_3cat_None   .           
## hs_cotinine_cdich_None       .           
## hs_globalexp2_None           .           
## hs_smk_parents_None          .           
## h_bfdur_Ter                  .           
## hs_bakery_prod_Ter           .           
## hs_beverages_Ter             .           
## hs_break_cer_Ter             .           
## hs_caff_drink_Ter            .           
## hs_dairy_Ter                 .           
## hs_fastfood_Ter              .           
## h_legume_preg_Ter            .           
## hs_org_food_Ter              .           
## hs_proc_meat_Ter             .           
## hs_readymade_Ter             .           
## hs_total_bread_Ter           .           
## hs_total_cereal_Ter          .           
## hs_total_fish_Ter            .           
## hs_total_fruits_Ter          .           
## hs_total_lipids_Ter          .           
## hs_total_meat_Ter            .           
## hs_total_potatoes_Ter        .           
## hs_total_sweets_Ter          .           
## hs_total_veg_Ter             .           
## hs_total_yog_Ter             .
best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 55 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -5.1837493309
## hs_as_c_Log2                 .           
## hs_cd_c_Log2                -0.0284173280
## hs_co_c_Log2                -0.0142399006
## hs_cs_c_Log2                 0.1021877004
## hs_cu_c_Log2                 0.6609749636
## hs_hg_c_Log2                -0.0171114022
## hs_mn_c_Log2                 .           
## hs_mo_c_Log2                -0.1088122852
## hs_pb_c_Log2                -0.0217220451
## hs_tl_cdich_None             .           
## hs_dde_cadj_Log2            -0.0420566432
## hs_ddt_cadj_Log2             .           
## hs_hcb_cadj_Log2             .           
## hs_pcb118_cadj_Log2          .           
## hs_pcb138_cadj_Log2          .           
## hs_pcb153_cadj_Log2         -0.1717111807
## hs_pcb170_cadj_Log2         -0.0597612863
## hs_pcb180_cadj_Log2          .           
## hs_dep_cadj_Log2            -0.0212245496
## hs_detp_cadj_Log2            0.0001951293
## hs_dmdtp_cdich_None          .           
## hs_dmp_cadj_Log2             .           
## hs_dmtp_cadj_Log2            .           
## hs_pbde153_cadj_Log2        -0.0361530310
## hs_pbde47_cadj_Log2          .           
## hs_pfhxs_c_Log2             -0.0102171741
## hs_pfna_c_Log2               .           
## hs_pfoa_c_Log2              -0.1415449389
## hs_pfos_c_Log2              -0.0486025276
## hs_pfunda_c_Log2             .           
## hs_bpa_cadj_Log2             .           
## hs_bupa_cadj_Log2            .           
## hs_etpa_cadj_Log2            .           
## hs_mepa_cadj_Log2           -0.0027645744
## hs_oxbe_cadj_Log2            0.0060056008
## hs_prpa_cadj_Log2            0.0039341981
## hs_trcs_cadj_Log2            .           
## hs_mbzp_cadj_Log2            0.0499456100
## hs_mecpp_cadj_Log2           .           
## hs_mehhp_cadj_Log2           .           
## hs_mehp_cadj_Log2            .           
## hs_meohp_cadj_Log2           .           
## hs_mep_cadj_Log2             .           
## hs_mibp_cadj_Log2           -0.0559548066
## hs_mnbp_cadj_Log2           -0.0134892925
## hs_ohminp_cadj_Log2          .           
## hs_oxominp_cadj_Log2         .           
## FAS_cat_None                 .           
## hs_contactfam_3cat_num_None  .           
## hs_hm_pers_None             -0.0161154323
## hs_participation_3cat_None   .           
## hs_cotinine_cdich_None       .           
## hs_globalexp2_None           .           
## hs_smk_parents_None          .
cat("Model with Covariates - Test MSE:", test_mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.185848
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.22833

Predicting Ridge

# RIDGE
fit_with_covariates_train <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian")
fit_with_covariates_test <- predict(fit_with_covariates_train, s = "lambda.min", newx = x_test)
test_mse_with_covariates <- mean((y_test - fit_with_covariates_test)^2)

x_train_chemicals_only <- as.matrix(selected_data[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(selected_data[test_indices, chemicals_full])

fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_with_covariates_train, xvar = "lambda", main = "Coefficients Path (With Covariates)")

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_with_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates_train, s = best_lambda)  # coefficients at the chosen lambda
## 82 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -42.160955638
## hs_child_age_None            -0.024653654
## h_cohort                      0.056124306
## e3_sex_None                   .          
## e3_yearbir_None               0.018777525
## h_edumc_None                  0.016255702
## h_native_None                 0.071773042
## hs_as_c_Log2                  0.006364945
## hs_cd_c_Log2                 -0.042733563
## hs_co_c_Log2                 -0.062828135
## hs_cs_c_Log2                  0.133399431
## hs_cu_c_Log2                  0.609965500
## hs_hg_c_Log2                 -0.023639135
## hs_mn_c_Log2                 -0.024596407
## hs_mo_c_Log2                 -0.115214282
## hs_pb_c_Log2                 -0.026511118
## hs_tl_cdich_None              .          
## hs_dde_cadj_Log2             -0.065049721
## hs_ddt_cadj_Log2              0.001881727
## hs_hcb_cadj_Log2             -0.032088608
## hs_pcb118_cadj_Log2           0.026419517
## hs_pcb138_cadj_Log2          -0.039147083
## hs_pcb153_cadj_Log2          -0.129106673
## hs_pcb170_cadj_Log2          -0.051481423
## hs_pcb180_cadj_Log2          -0.011929531
## hs_dep_cadj_Log2             -0.024659540
## hs_detp_cadj_Log2             0.006320785
## hs_dmdtp_cdich_None           .          
## hs_dmp_cadj_Log2             -0.002412305
## hs_dmtp_cadj_Log2             0.001052736
## hs_pbde153_cadj_Log2         -0.031205814
## hs_pbde47_cadj_Log2           0.009911956
## hs_pfhxs_c_Log2              -0.005219457
## hs_pfna_c_Log2                0.005003013
## hs_pfoa_c_Log2               -0.135134818
## hs_pfos_c_Log2               -0.072775595
## hs_pfunda_c_Log2              0.010645653
## hs_bpa_cadj_Log2             -0.004906553
## hs_bupa_cadj_Log2             0.004399280
## hs_etpa_cadj_Log2            -0.006127763
## hs_mepa_cadj_Log2            -0.015191889
## hs_oxbe_cadj_Log2             0.009931284
## hs_prpa_cadj_Log2             0.013906415
## hs_trcs_cadj_Log2             0.011135502
## hs_mbzp_cadj_Log2             0.054325547
## hs_mecpp_cadj_Log2           -0.009161963
## hs_mehhp_cadj_Log2            0.012553175
## hs_mehp_cadj_Log2            -0.014601711
## hs_meohp_cadj_Log2            0.003810181
## hs_mep_cadj_Log2              0.014463820
## hs_mibp_cadj_Log2            -0.042581695
## hs_mnbp_cadj_Log2            -0.052773051
## hs_ohminp_cadj_Log2          -0.024175703
## hs_oxominp_cadj_Log2          0.020047912
## FAS_cat_None                  .          
## hs_contactfam_3cat_num_None   .          
## hs_hm_pers_None              -0.023634190
## hs_participation_3cat_None    .          
## hs_cotinine_cdich_None        .          
## hs_globalexp2_None            .          
## hs_smk_parents_None           .          
## h_bfdur_Ter                   .          
## hs_bakery_prod_Ter            .          
## hs_beverages_Ter              .          
## hs_break_cer_Ter              .          
## hs_caff_drink_Ter             .          
## hs_dairy_Ter                  .          
## hs_fastfood_Ter               .          
## h_legume_preg_Ter             .          
## hs_org_food_Ter               .          
## hs_proc_meat_Ter              .          
## hs_readymade_Ter              .          
## hs_total_bread_Ter            .          
## hs_total_cereal_Ter           .          
## hs_total_fish_Ter             .          
## hs_total_fruits_Ter           .          
## hs_total_lipids_Ter           .          
## hs_total_meat_Ter             .          
## hs_total_potatoes_Ter         .          
## hs_total_sweets_Ter           .          
## hs_total_veg_Ter              .          
## hs_total_yog_Ter              .
best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 55 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -4.239624e+00
## hs_as_c_Log2                 6.052451e-03
## hs_cd_c_Log2                -3.886611e-02
## hs_co_c_Log2                -4.854412e-02
## hs_cs_c_Log2                 1.151068e-01
## hs_cu_c_Log2                 5.959543e-01
## hs_hg_c_Log2                -3.122837e-02
## hs_mn_c_Log2                -3.082238e-02
## hs_mo_c_Log2                -1.046374e-01
## hs_pb_c_Log2                -4.878458e-02
## hs_tl_cdich_None             .           
## hs_dde_cadj_Log2            -4.719029e-02
## hs_ddt_cadj_Log2             3.820665e-03
## hs_hcb_cadj_Log2            -1.971507e-02
## hs_pcb118_cadj_Log2          1.201408e-02
## hs_pcb138_cadj_Log2         -3.868824e-02
## hs_pcb153_cadj_Log2         -1.193205e-01
## hs_pcb170_cadj_Log2         -5.095402e-02
## hs_pcb180_cadj_Log2         -1.203014e-02
## hs_dep_cadj_Log2            -2.456945e-02
## hs_detp_cadj_Log2            7.811685e-03
## hs_dmdtp_cdich_None          .           
## hs_dmp_cadj_Log2            -2.075000e-03
## hs_dmtp_cadj_Log2            2.511909e-04
## hs_pbde153_cadj_Log2        -3.225764e-02
## hs_pbde47_cadj_Log2          5.263678e-03
## hs_pfhxs_c_Log2             -3.100931e-02
## hs_pfna_c_Log2               2.070402e-02
## hs_pfoa_c_Log2              -1.487317e-01
## hs_pfos_c_Log2              -6.238062e-02
## hs_pfunda_c_Log2             1.141304e-02
## hs_bpa_cadj_Log2            -9.692717e-05
## hs_bupa_cadj_Log2            6.208731e-03
## hs_etpa_cadj_Log2           -6.434128e-03
## hs_mepa_cadj_Log2           -1.573173e-02
## hs_oxbe_cadj_Log2            1.308651e-02
## hs_prpa_cadj_Log2            1.226900e-02
## hs_trcs_cadj_Log2            2.731754e-03
## hs_mbzp_cadj_Log2            5.356129e-02
## hs_mecpp_cadj_Log2           2.546898e-03
## hs_mehhp_cadj_Log2           1.984704e-02
## hs_mehp_cadj_Log2           -1.470314e-02
## hs_meohp_cadj_Log2           1.126869e-02
## hs_mep_cadj_Log2             3.543863e-03
## hs_mibp_cadj_Log2           -5.228014e-02
## hs_mnbp_cadj_Log2           -4.190152e-02
## hs_ohminp_cadj_Log2         -2.737398e-02
## hs_oxominp_cadj_Log2         2.144488e-02
## FAS_cat_None                 .           
## hs_contactfam_3cat_num_None  .           
## hs_hm_pers_None             -3.221832e-02
## hs_participation_3cat_None   .           
## hs_cotinine_cdich_None       .           
## hs_globalexp2_None           .           
## hs_smk_parents_None          .
cat("Model with Covariates - Test MSE:", test_mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.145497
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.186358

Predicting Elastic Net

# ELASTIC NET
fit_with_covariates_train <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian")
fit_with_covariates_test <- predict(fit_with_covariates_train, s = "lambda.min", newx = x_test)
test_mse_with_covariates <- mean((y_test - fit_with_covariates_test)^2)

x_train_chemicals_only <- as.matrix(selected_data[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(selected_data[test_indices, chemicals_full])

fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_with_covariates_train, xvar = "lambda", main = "Coefficients Path (With Covariates)")

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_with_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates_train, s = best_lambda)  # coefficients at the chosen lambda
## 82 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -5.3993079033
## hs_child_age_None            .           
## h_cohort                     0.0655211032
## e3_sex_None                  .           
## e3_yearbir_None              .           
## h_edumc_None                 .           
## h_native_None                0.0520046931
## hs_as_c_Log2                 .           
## hs_cd_c_Log2                -0.0329314231
## hs_co_c_Log2                -0.0224384954
## hs_cs_c_Log2                 0.0793739121
## hs_cu_c_Log2                 0.6441104526
## hs_hg_c_Log2                -0.0018698300
## hs_mn_c_Log2                 .           
## hs_mo_c_Log2                -0.1118336023
## hs_pb_c_Log2                 .           
## hs_tl_cdich_None             .           
## hs_dde_cadj_Log2            -0.0632356026
## hs_ddt_cadj_Log2             .           
## hs_hcb_cadj_Log2             .           
## hs_pcb118_cadj_Log2          .           
## hs_pcb138_cadj_Log2          .           
## hs_pcb153_cadj_Log2         -0.1833493206
## hs_pcb170_cadj_Log2         -0.0562600632
## hs_pcb180_cadj_Log2          .           
## hs_dep_cadj_Log2            -0.0188690919
## hs_detp_cadj_Log2            .           
## hs_dmdtp_cdich_None          .           
## hs_dmp_cadj_Log2             .           
## hs_dmtp_cadj_Log2            .           
## hs_pbde153_cadj_Log2        -0.0326246963
## hs_pbde47_cadj_Log2          .           
## hs_pfhxs_c_Log2              .           
## hs_pfna_c_Log2               .           
## hs_pfoa_c_Log2              -0.1022981441
## hs_pfos_c_Log2              -0.0730756063
## hs_pfunda_c_Log2             .           
## hs_bpa_cadj_Log2             .           
## hs_bupa_cadj_Log2            .           
## hs_etpa_cadj_Log2            .           
## hs_mepa_cadj_Log2            .           
## hs_oxbe_cadj_Log2            0.0007162562
## hs_prpa_cadj_Log2            0.0056296363
## hs_trcs_cadj_Log2            .           
## hs_mbzp_cadj_Log2            0.0335195179
## hs_mecpp_cadj_Log2           .           
## hs_mehhp_cadj_Log2           .           
## hs_mehp_cadj_Log2            .           
## hs_meohp_cadj_Log2           .           
## hs_mep_cadj_Log2             .           
## hs_mibp_cadj_Log2           -0.0241809131
## hs_mnbp_cadj_Log2           -0.0236651281
## hs_ohminp_cadj_Log2          .           
## hs_oxominp_cadj_Log2         .           
## FAS_cat_None                 .           
## hs_contactfam_3cat_num_None  .           
## hs_hm_pers_None             -0.0031390265
## hs_participation_3cat_None   .           
## hs_cotinine_cdich_None       .           
## hs_globalexp2_None           .           
## hs_smk_parents_None          .           
## h_bfdur_Ter                  .           
## hs_bakery_prod_Ter           .           
## hs_beverages_Ter             .           
## hs_break_cer_Ter             .           
## hs_caff_drink_Ter            .           
## hs_dairy_Ter                 .           
## hs_fastfood_Ter              .           
## h_legume_preg_Ter            .           
## hs_org_food_Ter              .           
## hs_proc_meat_Ter             .           
## hs_readymade_Ter             .           
## hs_total_bread_Ter           .           
## hs_total_cereal_Ter          .           
## hs_total_fish_Ter            .           
## hs_total_fruits_Ter          .           
## hs_total_lipids_Ter          .           
## hs_total_meat_Ter            .           
## hs_total_potatoes_Ter        .           
## hs_total_sweets_Ter          .           
## hs_total_veg_Ter             .           
## hs_total_yog_Ter             .
best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 55 x 1 sparse Matrix of class "dgCMatrix"
##                                       s1
## (Intercept)                 -5.165056742
## hs_as_c_Log2                 .          
## hs_cd_c_Log2                -0.029587674
## hs_co_c_Log2                -0.018234610
## hs_cs_c_Log2                 0.105479459
## hs_cu_c_Log2                 0.660783281
## hs_hg_c_Log2                -0.018407443
## hs_mn_c_Log2                 .          
## hs_mo_c_Log2                -0.108921552
## hs_pb_c_Log2                -0.024687445
## hs_tl_cdich_None             .          
## hs_dde_cadj_Log2            -0.043043336
## hs_ddt_cadj_Log2             .          
## hs_hcb_cadj_Log2             .          
## hs_pcb118_cadj_Log2          .          
## hs_pcb138_cadj_Log2          .          
## hs_pcb153_cadj_Log2         -0.169983885
## hs_pcb170_cadj_Log2         -0.059558355
## hs_pcb180_cadj_Log2          .          
## hs_dep_cadj_Log2            -0.021814037
## hs_detp_cadj_Log2            0.001177980
## hs_dmdtp_cdich_None          .          
## hs_dmp_cadj_Log2             .          
## hs_dmtp_cadj_Log2            .          
## hs_pbde153_cadj_Log2        -0.035870956
## hs_pbde47_cadj_Log2          .          
## hs_pfhxs_c_Log2             -0.013483975
## hs_pfna_c_Log2               .          
## hs_pfoa_c_Log2              -0.141668953
## hs_pfos_c_Log2              -0.048925424
## hs_pfunda_c_Log2             .          
## hs_bpa_cadj_Log2             .          
## hs_bupa_cadj_Log2            .          
## hs_etpa_cadj_Log2            .          
## hs_mepa_cadj_Log2           -0.005175616
## hs_oxbe_cadj_Log2            0.007141167
## hs_prpa_cadj_Log2            0.005237280
## hs_trcs_cadj_Log2            .          
## hs_mbzp_cadj_Log2            0.051380905
## hs_mecpp_cadj_Log2           .          
## hs_mehhp_cadj_Log2           .          
## hs_mehp_cadj_Log2            .          
## hs_meohp_cadj_Log2           .          
## hs_mep_cadj_Log2             .          
## hs_mibp_cadj_Log2           -0.055499484
## hs_mnbp_cadj_Log2           -0.016778105
## hs_ohminp_cadj_Log2          .          
## hs_oxominp_cadj_Log2         .          
## FAS_cat_None                 .          
## hs_contactfam_3cat_num_None  .          
## hs_hm_pers_None             -0.018050249
## hs_participation_3cat_None   .          
## hs_cotinine_cdich_None       .          
## hs_globalexp2_None           .          
## hs_smk_parents_None          .
cat("Model with Covariates - Test MSE:", test_mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.185552
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.225393

Postnatal Diet Data

diet_data <- selected_data[, postnatal_diet]

x_diet <- model.matrix(~ . + 0, data = diet_data)  # adding 0 omits the intercept

covariates <- selected_data[, c("e3_sex_None", "e3_yearbir_None", "h_edumc_None", "h_cohort", "hs_child_age_None")]

x_covariates <- model.matrix(~ . + 0, data = covariates)

x_full <- cbind(x_diet, x_covariates)

# no missing values
x_full[is.na(x_full)] <- 0
x_diet[is.na(x_diet)] <- 0

y <- as.numeric(selected_data$hs_zbmi_who)

# fit model with postnatal diet and with or without covariates
fit_with_covariates <- cv.glmnet(x_full, y, alpha = 1, family = "gaussian")
fit_without_covariates <- cv.glmnet(x_diet, y, alpha = 1, family = "gaussian")

plot(fit_with_covariates)

plot(fit_without_covariates)

cat("Model with Covariates - Lambda Min:", fit_with_covariates$lambda.min, "\n")
## Model with Covariates - Lambda Min: 0.01923849
cat("Model without Covariates - Lambda Min:", fit_without_covariates$lambda.min, "\n")
## Model without Covariates - Lambda Min: 0.01519071

Predicting Lasso

# LASSO with train/test
set.seed(101)  
train_indices <- sample(seq_len(nrow(selected_data)), size = floor(0.7 * nrow(selected_data)))
test_indices <- setdiff(seq_len(nrow(selected_data)), train_indices)

diet_data <- selected_data[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])  
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])  

covariates <- selected_data[, c("e3_sex_None", "e3_yearbir_None", "h_edumc_None", "h_cohort", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ]) 
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])

x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)

x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0

y_train <- as.numeric(selected_data$hs_zbmi_who[train_indices])
y_test <- as.numeric(selected_data$hs_zbmi_who[test_indices])

# fit models
fit_with_covariates <- cv.glmnet(x_full_train, y_train, alpha = 1, family = "gaussian")
fit_with_covariates
## 
## Call:  cv.glmnet(x = x_full_train, y = y_train, alpha = 1, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.04164    17   1.404 0.06582      17
## 1se 0.18447     1   1.440 0.06226       0
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 1, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 1, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.03609    16   1.423 0.08753      14
## 1se 0.14570     1   1.440 0.08232       0
plot(fit_with_covariates, xvar = "lambda", main = "Coefficient Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_with_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates, s = best_lambda)  # coefficients at the chosen lambda
## 59 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                     0.5648339722
## h_bfdur_Ter(0,10.8]             .           
## h_bfdur_Ter(10.8,34.9]          .           
## h_bfdur_Ter(34.9,Inf]           .           
## hs_bakery_prod_Ter(2,6]         .           
## hs_bakery_prod_Ter(6,Inf]      -0.1133584362
## hs_beverages_Ter(0.132,1]       .           
## hs_beverages_Ter(1,Inf]         .           
## hs_break_cer_Ter(1.1,5.5]       .           
## hs_break_cer_Ter(5.5,Inf]      -0.0316932102
## hs_caff_drink_Ter(0.132,Inf]    .           
## hs_dairy_Ter(14.6,25.6]         0.0088653749
## hs_dairy_Ter(25.6,Inf]          .           
## hs_fastfood_Ter(0.132,0.5]      .           
## hs_fastfood_Ter(0.5,Inf]        .           
## h_legume_preg_Ter(0.5,2]        .           
## h_legume_preg_Ter(2,Inf]        .           
## hs_org_food_Ter(0.132,1]        .           
## hs_org_food_Ter(1,Inf]         -0.1213297545
## hs_proc_meat_Ter(1.5,4]         .           
## hs_proc_meat_Ter(4,Inf]         .           
## hs_readymade_Ter(0.132,0.5]     .           
## hs_readymade_Ter(0.5,Inf]       .           
## hs_total_bread_Ter(7,17.5]      .           
## hs_total_bread_Ter(17.5,Inf]    .           
## hs_total_cereal_Ter(14.1,23.6]  .           
## hs_total_cereal_Ter(23.6,Inf]   .           
## hs_total_fish_Ter(1.5,3]        .           
## hs_total_fish_Ter(3,Inf]        .           
## hs_total_fruits_Ter(7,14.1]     0.0002333088
## hs_total_fruits_Ter(14.1,Inf]  -0.0126943442
## hs_total_lipids_Ter(3,7]        .           
## hs_total_lipids_Ter(7,Inf]     -0.0063934032
## hs_total_meat_Ter(6,9]          .           
## hs_total_meat_Ter(9,Inf]        .           
## hs_total_potatoes_Ter(3,4]      .           
## hs_total_potatoes_Ter(4,Inf]    .           
## hs_total_sweets_Ter(4.1,8.5]   -0.0749964702
## hs_total_sweets_Ter(8.5,Inf]    .           
## hs_total_veg_Ter(6,8.5]         .           
## hs_total_veg_Ter(8.5,Inf]      -0.0598834839
## hs_total_yog_Ter(6,8.5]         .           
## hs_total_yog_Ter(8.5,Inf]       .           
## e3_sex_Nonefemale              -0.0456861174
## e3_sex_Nonemale                 .           
## e3_yearbir_None2004            -0.0438776088
## e3_yearbir_None2005             .           
## e3_yearbir_None2006             .           
## e3_yearbir_None2007             .           
## e3_yearbir_None2008             .           
## e3_yearbir_None2009             .           
## h_edumc_None2                   0.0018813738
## h_edumc_None3                  -0.0724241304
## h_cohort2                       .           
## h_cohort3                       0.3406608342
## h_cohort4                       0.1018684574
## h_cohort5                      -0.1592809829
## h_cohort6                       0.1985783490
## hs_child_age_None               .
best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 43 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                     0.636340202
## h_bfdur_Ter(0,10.8]             .          
## h_bfdur_Ter(10.8,34.9]          0.023906468
## h_bfdur_Ter(34.9,Inf]           .          
## hs_bakery_prod_Ter(2,6]         .          
## hs_bakery_prod_Ter(6,Inf]      -0.056385462
## hs_beverages_Ter(0.132,1]       .          
## hs_beverages_Ter(1,Inf]         .          
## hs_break_cer_Ter(1.1,5.5]       .          
## hs_break_cer_Ter(5.5,Inf]      -0.055371885
## hs_caff_drink_Ter(0.132,Inf]    .          
## hs_dairy_Ter(14.6,25.6]         0.046097840
## hs_dairy_Ter(25.6,Inf]          .          
## hs_fastfood_Ter(0.132,0.5]      .          
## hs_fastfood_Ter(0.5,Inf]        .          
## h_legume_preg_Ter(0.5,2]        0.122896858
## h_legume_preg_Ter(2,Inf]        .          
## hs_org_food_Ter(0.132,1]        .          
## hs_org_food_Ter(1,Inf]         -0.184946634
## hs_proc_meat_Ter(1.5,4]         0.005012152
## hs_proc_meat_Ter(4,Inf]        -0.007381914
## hs_readymade_Ter(0.132,0.5]     .          
## hs_readymade_Ter(0.5,Inf]       .          
## hs_total_bread_Ter(7,17.5]      .          
## hs_total_bread_Ter(17.5,Inf]    .          
## hs_total_cereal_Ter(14.1,23.6]  .          
## hs_total_cereal_Ter(23.6,Inf]   .          
## hs_total_fish_Ter(1.5,3]       -0.056227911
## hs_total_fish_Ter(3,Inf]        .          
## hs_total_fruits_Ter(7,14.1]     0.009755541
## hs_total_fruits_Ter(14.1,Inf]  -0.053778743
## hs_total_lipids_Ter(3,7]        .          
## hs_total_lipids_Ter(7,Inf]     -0.081293095
## hs_total_meat_Ter(6,9]          .          
## hs_total_meat_Ter(9,Inf]        .          
## hs_total_potatoes_Ter(3,4]      .          
## hs_total_potatoes_Ter(4,Inf]    .          
## hs_total_sweets_Ter(4.1,8.5]   -0.098908669
## hs_total_sweets_Ter(8.5,Inf]    .          
## hs_total_veg_Ter(6,8.5]         .          
## hs_total_veg_Ter(8.5,Inf]      -0.118700721
## hs_total_yog_Ter(6,8.5]         .          
## hs_total_yog_Ter(8.5,Inf]       .
predictions_with_covariates <- predict(fit_with_covariates, s = "lambda.min", newx = x_full_test)
mse_with_covariates <- mean((y_test - predictions_with_covariates)^2)

predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model with Covariates - Test MSE:", mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.290294
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.339523

Predicting Ridge

# RIDGE
fit_with_covariates <- cv.glmnet(x_full_train, y_train, alpha = 0, family = "gaussian")
fit_with_covariates
## 
## Call:  cv.glmnet(x = x_full_train, y = y_train, alpha = 0, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min    2.8    46   1.411 0.05847      58
## 1se  184.5     1   1.443 0.05868      58
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min   2.67    44   1.428 0.09579      42
## 1se 145.70     1   1.442 0.09676      42
plot(fit_with_covariates, xvar = "lambda", main = "Coefficient Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_with_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates, s = best_lambda)  # coefficients at the chosen lambda
## 59 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                     5.284758e-01
## h_bfdur_Ter(0,10.8]            -1.099740e-02
## h_bfdur_Ter(10.8,34.9]          2.348192e-02
## h_bfdur_Ter(34.9,Inf]          -5.818857e-03
## hs_bakery_prod_Ter(2,6]         1.787112e-02
## hs_bakery_prod_Ter(6,Inf]      -4.243477e-02
## hs_beverages_Ter(0.132,1]      -5.906794e-03
## hs_beverages_Ter(1,Inf]         6.641315e-04
## hs_break_cer_Ter(1.1,5.5]      -6.605575e-03
## hs_break_cer_Ter(5.5,Inf]      -3.912608e-02
## hs_caff_drink_Ter(0.132,Inf]   -8.278957e-03
## hs_dairy_Ter(14.6,25.6]         3.591020e-02
## hs_dairy_Ter(25.6,Inf]          6.424051e-03
## hs_fastfood_Ter(0.132,0.5]      2.178827e-02
## hs_fastfood_Ter(0.5,Inf]       -4.269624e-03
## h_legume_preg_Ter(0.5,2]        4.974182e-02
## h_legume_preg_Ter(2,Inf]        1.585245e-03
## hs_org_food_Ter(0.132,1]        2.509655e-02
## hs_org_food_Ter(1,Inf]         -6.642514e-02
## hs_proc_meat_Ter(1.5,4]         2.290516e-02
## hs_proc_meat_Ter(4,Inf]        -1.869895e-02
## hs_readymade_Ter(0.132,0.5]    -3.529727e-03
## hs_readymade_Ter(0.5,Inf]       2.083586e-02
## hs_total_bread_Ter(7,17.5]     -1.349903e-02
## hs_total_bread_Ter(17.5,Inf]    3.056733e-03
## hs_total_cereal_Ter(14.1,23.6]  7.759686e-03
## hs_total_cereal_Ter(23.6,Inf]  -7.029716e-03
## hs_total_fish_Ter(1.5,3]       -3.342082e-02
## hs_total_fish_Ter(3,Inf]       -6.005614e-03
## hs_total_fruits_Ter(7,14.1]     2.803696e-02
## hs_total_fruits_Ter(14.1,Inf]  -3.681803e-02
## hs_total_lipids_Ter(3,7]        4.843535e-03
## hs_total_lipids_Ter(7,Inf]     -4.016998e-02
## hs_total_meat_Ter(6,9]          5.159825e-04
## hs_total_meat_Ter(9,Inf]        3.464491e-05
## hs_total_potatoes_Ter(3,4]      1.513769e-02
## hs_total_potatoes_Ter(4,Inf]   -3.104207e-03
## hs_total_sweets_Ter(4.1,8.5]   -4.691703e-02
## hs_total_sweets_Ter(8.5,Inf]    3.756444e-03
## hs_total_veg_Ter(6,8.5]        -1.722606e-03
## hs_total_veg_Ter(8.5,Inf]      -5.121521e-02
## hs_total_yog_Ter(6,8.5]        -7.640896e-03
## hs_total_yog_Ter(8.5,Inf]      -9.306272e-03
## e3_sex_Nonefemale              -3.002022e-02
## e3_sex_Nonemale                 3.001873e-02
## e3_yearbir_None2004            -6.037200e-02
## e3_yearbir_None2005             2.739531e-02
## e3_yearbir_None2006            -2.330959e-02
## e3_yearbir_None2007            -5.212367e-03
## e3_yearbir_None2008             2.717594e-02
## e3_yearbir_None2009            -3.054503e-02
## h_edumc_None2                   4.119509e-02
## h_edumc_None3                  -5.287052e-02
## h_cohort2                      -3.451696e-02
## h_cohort3                       1.160360e-01
## h_cohort4                       3.965935e-02
## h_cohort5                      -9.174135e-02
## h_cohort6                       5.828852e-02
## hs_child_age_None              -4.847617e-03
best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 43 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                     5.124985e-01
## h_bfdur_Ter(0,10.8]            -1.358118e-02
## h_bfdur_Ter(10.8,34.9]          3.774674e-02
## h_bfdur_Ter(34.9,Inf]          -1.340757e-02
## hs_bakery_prod_Ter(2,6]         2.587561e-02
## hs_bakery_prod_Ter(6,Inf]      -3.405874e-02
## hs_beverages_Ter(0.132,1]      -9.027534e-03
## hs_beverages_Ter(1,Inf]         5.374768e-05
## hs_break_cer_Ter(1.1,5.5]      -6.059998e-03
## hs_break_cer_Ter(5.5,Inf]      -4.124579e-02
## hs_caff_drink_Ter(0.132,Inf]   -1.677759e-02
## hs_dairy_Ter(14.6,25.6]         4.144777e-02
## hs_dairy_Ter(25.6,Inf]          1.323447e-03
## hs_fastfood_Ter(0.132,0.5]      2.022363e-02
## hs_fastfood_Ter(0.5,Inf]        5.573297e-04
## h_legume_preg_Ter(0.5,2]        7.276374e-02
## h_legume_preg_Ter(2,Inf]        1.171750e-02
## hs_org_food_Ter(0.132,1]        1.744858e-02
## hs_org_food_Ter(1,Inf]         -8.024588e-02
## hs_proc_meat_Ter(1.5,4]         2.556512e-02
## hs_proc_meat_Ter(4,Inf]        -2.290499e-02
## hs_readymade_Ter(0.132,0.5]    -1.518843e-03
## hs_readymade_Ter(0.5,Inf]       1.336285e-02
## hs_total_bread_Ter(7,17.5]     -5.426895e-03
## hs_total_bread_Ter(17.5,Inf]   -7.750245e-03
## hs_total_cereal_Ter(14.1,23.6]  1.018806e-02
## hs_total_cereal_Ter(23.6,Inf]  -1.449850e-02
## hs_total_fish_Ter(1.5,3]       -4.277028e-02
## hs_total_fish_Ter(3,Inf]       -7.793985e-03
## hs_total_fruits_Ter(7,14.1]     3.021019e-02
## hs_total_fruits_Ter(14.1,Inf]  -4.358446e-02
## hs_total_lipids_Ter(3,7]       -2.401764e-03
## hs_total_lipids_Ter(7,Inf]     -5.319258e-02
## hs_total_meat_Ter(6,9]          2.804418e-04
## hs_total_meat_Ter(9,Inf]        2.156405e-03
## hs_total_potatoes_Ter(3,4]      1.359808e-02
## hs_total_potatoes_Ter(4,Inf]    6.311069e-03
## hs_total_sweets_Ter(4.1,8.5]   -4.902698e-02
## hs_total_sweets_Ter(8.5,Inf]    3.982619e-04
## hs_total_veg_Ter(6,8.5]        -1.589070e-03
## hs_total_veg_Ter(8.5,Inf]      -6.559892e-02
## hs_total_yog_Ter(6,8.5]        -1.131931e-02
## hs_total_yog_Ter(8.5,Inf]      -1.062038e-02
predictions_with_covariates <- predict(fit_with_covariates, s = "lambda.min", newx = x_full_test)
mse_with_covariates <- mean((y_test - predictions_with_covariates)^2)

predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model with Covariates - Test MSE:", mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.293595
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.32074

Predicting Elastic Net

#ELASTIC NET
fit_with_covariates <- cv.glmnet(x_full_train, y_train, alpha = 0.5, family = "gaussian")
fit_with_covariates
## 
## Call:  cv.glmnet(x = x_full_train, y = y_train, alpha = 0.5, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min 0.0759    18   1.395 0.08667      21
## 1se 0.3689     1   1.443 0.08799       0
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.07218    16   1.423 0.04773      14
## 1se 0.29139     1   1.443 0.04721       0
plot(fit_with_covariates, xvar = "lambda", main = "Coefficient Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

predictions_with_covariates <- predict(fit_with_covariates, s = "lambda.min", newx = x_full_test)
mse_with_covariates <- mean((y_test - predictions_with_covariates)^2)

predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model with Covariates - Test MSE:", mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.288522
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.339045

Combined Data (Chemicals & Postnatal Diet)

x_chemicals <- as.matrix(selected_data[, chemicals_full])
x_diet <- model.matrix(~ . + 0, data = selected_data[, postnatal_diet])

x_covariates <- model.matrix(~ . + 0, data = selected_data[, covariate_names])

# combine all data into one full model matrix withand without covariates
x_full_with_covariates <- cbind(x_chemicals, x_diet, x_covariates)
x_full_without_covariates <- cbind(x_chemicals, x_diet)

# no missing values
x_full_with_covariates[is.na(x_full_with_covariates)] <- 0
x_full_without_covariates[is.na(x_full_without_covariates)] <- 0

y <- as.numeric(selected_data$hs_zbmi_who)

# fit model with and without covariates
fit_with_covariates <- cv.glmnet(x_full_with_covariates, y, alpha = 1, family = "gaussian")
fit_without_covariates <- cv.glmnet(x_full_without_covariates, y, alpha = 1, family = "gaussian")

plot(fit_with_covariates, xvar = "lambda", main = "Coefficients Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

cat("Model with Covariates - Lambda Min:", fit_with_covariates$lambda.min, "\n")
## Model with Covariates - Lambda Min: 0.02532874
cat("Model without Covariates - Lambda Min:", fit_without_covariates$lambda.min, "\n")
## Model without Covariates - Lambda Min: 0.01320643

Predicting Lasso

set.seed(101)
train_indices <- sample(seq_len(nrow(selected_data)), size = floor(0.7 * nrow(selected_data)))
test_indices <- setdiff(seq_len(nrow(selected_data)), train_indices)

diet_data <- selected_data[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])

chemical_data <- selected_data[, chemicals_full]
x_chemical_train <- as.matrix(chemical_data[train_indices, ])
x_chemical_test <- as.matrix(chemical_data[test_indices, ])

covariates <- selected_data[, c("e3_sex_None", "e3_yearbir_None", "h_edumc_None", "h_cohort", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])

# combine diet and chemical data with and without covariates
x_combined_train <- cbind(x_diet_train, x_chemical_train)
x_combined_test <- cbind(x_diet_test, x_chemical_test)

x_full_train <- cbind(x_combined_train, x_covariates_train)
x_full_test <- cbind(x_combined_test, x_covariates_test)

# make sure no missing values
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_combined_train[is.na(x_combined_train)] <- 0
x_combined_test[is.na(x_combined_test)] <- 0

y_train <- as.numeric(selected_data$hs_zbmi_who[train_indices])
y_test <- as.numeric(selected_data$hs_zbmi_who[test_indices])

# LASSO
fit_with_covariates <- cv.glmnet(x_full_train, y_train, alpha = 1, family = "gaussian")
predictions_with_covariates <- predict(fit_with_covariates, s = "lambda.min", newx = x_full_test)
mse_with_covariates <- mean((y_test - predictions_with_covariates)^2)

fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 1, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

plot(fit_with_covariates, xvar = "lambda", main = "Coefficient Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_with_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates, s = best_lambda)  # coefficients at the chosen lambda
## 113 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -4.760129e+00
## h_bfdur_Ter(0,10.8]            -8.142886e-02
## h_bfdur_Ter(10.8,34.9]          .           
## h_bfdur_Ter(34.9,Inf]           3.535147e-02
## hs_bakery_prod_Ter(2,6]         .           
## hs_bakery_prod_Ter(6,Inf]      -1.905172e-01
## hs_beverages_Ter(0.132,1]       .           
## hs_beverages_Ter(1,Inf]         .           
## hs_break_cer_Ter(1.1,5.5]       .           
## hs_break_cer_Ter(5.5,Inf]       .           
## hs_caff_drink_Ter(0.132,Inf]    .           
## hs_dairy_Ter(14.6,25.6]         .           
## hs_dairy_Ter(25.6,Inf]          .           
## hs_fastfood_Ter(0.132,0.5]      4.249786e-02
## hs_fastfood_Ter(0.5,Inf]        .           
## h_legume_preg_Ter(0.5,2]        .           
## h_legume_preg_Ter(2,Inf]       -4.964957e-02
## hs_org_food_Ter(0.132,1]        .           
## hs_org_food_Ter(1,Inf]          .           
## hs_proc_meat_Ter(1.5,4]         .           
## hs_proc_meat_Ter(4,Inf]         .           
## hs_readymade_Ter(0.132,0.5]     .           
## hs_readymade_Ter(0.5,Inf]       .           
## hs_total_bread_Ter(7,17.5]     -2.257386e-02
## hs_total_bread_Ter(17.5,Inf]    .           
## hs_total_cereal_Ter(14.1,23.6]  .           
## hs_total_cereal_Ter(23.6,Inf]   .           
## hs_total_fish_Ter(1.5,3]        .           
## hs_total_fish_Ter(3,Inf]        .           
## hs_total_fruits_Ter(7,14.1]     .           
## hs_total_fruits_Ter(14.1,Inf]  -4.053583e-03
## hs_total_lipids_Ter(3,7]        .           
## hs_total_lipids_Ter(7,Inf]     -7.313663e-03
## hs_total_meat_Ter(6,9]          .           
## hs_total_meat_Ter(9,Inf]        .           
## hs_total_potatoes_Ter(3,4]      .           
## hs_total_potatoes_Ter(4,Inf]    .           
## hs_total_sweets_Ter(4.1,8.5]   -4.057094e-03
## hs_total_sweets_Ter(8.5,Inf]    .           
## hs_total_veg_Ter(6,8.5]         .           
## hs_total_veg_Ter(8.5,Inf]       .           
## hs_total_yog_Ter(6,8.5]         .           
## hs_total_yog_Ter(8.5,Inf]       .           
## hs_as_c_Log2                    .           
## hs_cd_c_Log2                   -7.112010e-03
## hs_co_c_Log2                    .           
## hs_cs_c_Log2                    1.008616e-01
## hs_cu_c_Log2                    6.321231e-01
## hs_hg_c_Log2                   -1.434019e-02
## hs_mn_c_Log2                    .           
## hs_mo_c_Log2                   -8.139669e-02
## hs_pb_c_Log2                   -2.612069e-03
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -2.914806e-02
## hs_ddt_cadj_Log2                .           
## hs_hcb_cadj_Log2                .           
## hs_pcb118_cadj_Log2             .           
## hs_pcb138_cadj_Log2             .           
## hs_pcb153_cadj_Log2            -2.722245e-01
## hs_pcb170_cadj_Log2            -5.353440e-02
## hs_pcb180_cadj_Log2             .           
## hs_dep_cadj_Log2               -1.500516e-02
## hs_detp_cadj_Log2               .           
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2                .           
## hs_dmtp_cadj_Log2               .           
## hs_pbde153_cadj_Log2           -3.347976e-02
## hs_pbde47_cadj_Log2             .           
## hs_pfhxs_c_Log2                 .           
## hs_pfna_c_Log2                  .           
## hs_pfoa_c_Log2                 -1.269209e-01
## hs_pfos_c_Log2                  .           
## hs_pfunda_c_Log2                .           
## hs_bpa_cadj_Log2                .           
## hs_bupa_cadj_Log2               .           
## hs_etpa_cadj_Log2               .           
## hs_mepa_cadj_Log2              -1.611062e-03
## hs_oxbe_cadj_Log2               .           
## hs_prpa_cadj_Log2               .           
## hs_trcs_cadj_Log2               .           
## hs_mbzp_cadj_Log2               3.333918e-02
## hs_mecpp_cadj_Log2              .           
## hs_mehhp_cadj_Log2              .           
## hs_mehp_cadj_Log2               .           
## hs_meohp_cadj_Log2              .           
## hs_mep_cadj_Log2                .           
## hs_mibp_cadj_Log2              -1.952485e-02
## hs_mnbp_cadj_Log2               .           
## hs_ohminp_cadj_Log2             .           
## hs_oxominp_cadj_Log2            .           
## FAS_cat_None                    .           
## hs_contactfam_3cat_num_None     .           
## hs_hm_pers_None                 .           
## hs_participation_3cat_None      .           
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .           
## hs_smk_parents_None             .           
## e3_sex_Nonefemale              -1.145910e-01
## e3_sex_Nonemale                 4.837453e-16
## e3_yearbir_None2004            -8.130929e-02
## e3_yearbir_None2005             .           
## e3_yearbir_None2006             .           
## e3_yearbir_None2007             .           
## e3_yearbir_None2008             .           
## e3_yearbir_None2009             .           
## h_edumc_None2                   .           
## h_edumc_None3                   .           
## h_cohort2                      -4.177208e-02
## h_cohort3                       3.549906e-01
## h_cohort4                       2.054442e-01
## h_cohort5                       .           
## h_cohort6                       .           
## hs_child_age_None               .
best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 97 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -5.2935057038
## h_bfdur_Ter(0,10.8]            -0.1339940396
## h_bfdur_Ter(10.8,34.9]          .           
## h_bfdur_Ter(34.9,Inf]           0.0089494335
## hs_bakery_prod_Ter(2,6]         .           
## hs_bakery_prod_Ter(6,Inf]      -0.2141485632
## hs_beverages_Ter(0.132,1]       .           
## hs_beverages_Ter(1,Inf]         .           
## hs_break_cer_Ter(1.1,5.5]       .           
## hs_break_cer_Ter(5.5,Inf]       .           
## hs_caff_drink_Ter(0.132,Inf]    .           
## hs_dairy_Ter(14.6,25.6]         0.0067011171
## hs_dairy_Ter(25.6,Inf]          .           
## hs_fastfood_Ter(0.132,0.5]      0.0753896525
## hs_fastfood_Ter(0.5,Inf]        .           
## h_legume_preg_Ter(0.5,2]        .           
## h_legume_preg_Ter(2,Inf]       -0.0976436998
## hs_org_food_Ter(0.132,1]        .           
## hs_org_food_Ter(1,Inf]          .           
## hs_proc_meat_Ter(1.5,4]         .           
## hs_proc_meat_Ter(4,Inf]         .           
## hs_readymade_Ter(0.132,0.5]     .           
## hs_readymade_Ter(0.5,Inf]       0.0093784806
## hs_total_bread_Ter(7,17.5]     -0.0133317671
## hs_total_bread_Ter(17.5,Inf]    .           
## hs_total_cereal_Ter(14.1,23.6]  .           
## hs_total_cereal_Ter(23.6,Inf]   .           
## hs_total_fish_Ter(1.5,3]       -0.0293188976
## hs_total_fish_Ter(3,Inf]        .           
## hs_total_fruits_Ter(7,14.1]     .           
## hs_total_fruits_Ter(14.1,Inf]  -0.0224728007
## hs_total_lipids_Ter(3,7]        .           
## hs_total_lipids_Ter(7,Inf]     -0.0481926073
## hs_total_meat_Ter(6,9]          .           
## hs_total_meat_Ter(9,Inf]        .           
## hs_total_potatoes_Ter(3,4]      0.0164385738
## hs_total_potatoes_Ter(4,Inf]    .           
## hs_total_sweets_Ter(4.1,8.5]   -0.0198498846
## hs_total_sweets_Ter(8.5,Inf]    .           
## hs_total_veg_Ter(6,8.5]         .           
## hs_total_veg_Ter(8.5,Inf]      -0.0427817570
## hs_total_yog_Ter(6,8.5]         .           
## hs_total_yog_Ter(8.5,Inf]       .           
## hs_as_c_Log2                    .           
## hs_cd_c_Log2                   -0.0265679349
## hs_co_c_Log2                   -0.0097399554
## hs_cs_c_Log2                    0.0689070580
## hs_cu_c_Log2                    0.6918303003
## hs_hg_c_Log2                   -0.0164166806
## hs_mn_c_Log2                    .           
## hs_mo_c_Log2                   -0.1028019657
## hs_pb_c_Log2                    .           
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -0.0281622667
## hs_ddt_cadj_Log2                .           
## hs_hcb_cadj_Log2                .           
## hs_pcb118_cadj_Log2             .           
## hs_pcb138_cadj_Log2             .           
## hs_pcb153_cadj_Log2            -0.2416501185
## hs_pcb170_cadj_Log2            -0.0554100591
## hs_pcb180_cadj_Log2             .           
## hs_dep_cadj_Log2               -0.0195864300
## hs_detp_cadj_Log2               .           
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2                .           
## hs_dmtp_cadj_Log2               .           
## hs_pbde153_cadj_Log2           -0.0356346258
## hs_pbde47_cadj_Log2             .           
## hs_pfhxs_c_Log2                 .           
## hs_pfna_c_Log2                  .           
## hs_pfoa_c_Log2                 -0.1213922819
## hs_pfos_c_Log2                 -0.0523978803
## hs_pfunda_c_Log2                .           
## hs_bpa_cadj_Log2                .           
## hs_bupa_cadj_Log2               .           
## hs_etpa_cadj_Log2               .           
## hs_mepa_cadj_Log2               .           
## hs_oxbe_cadj_Log2               .           
## hs_prpa_cadj_Log2               0.0004520332
## hs_trcs_cadj_Log2               .           
## hs_mbzp_cadj_Log2               0.0467602173
## hs_mecpp_cadj_Log2              .           
## hs_mehhp_cadj_Log2              .           
## hs_mehp_cadj_Log2               .           
## hs_meohp_cadj_Log2              .           
## hs_mep_cadj_Log2                .           
## hs_mibp_cadj_Log2              -0.0313108890
## hs_mnbp_cadj_Log2              -0.0123108272
## hs_ohminp_cadj_Log2             .           
## hs_oxominp_cadj_Log2            .           
## FAS_cat_None                    .           
## hs_contactfam_3cat_num_None     .           
## hs_hm_pers_None                -0.0056381866
## hs_participation_3cat_None      .           
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .           
## hs_smk_parents_None             .
cat("Model with Covariates - Test MSE:", mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.173885
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.203556

Predicting Ridge

# RIDGE
fit_with_covariates <- cv.glmnet(x_full_train, y_train, alpha = 0, family = "gaussian")
predictions_with_covariates <- predict(fit_with_covariates, s = "lambda.min", newx = x_full_test)
mse_with_covariates <- mean((y_test - predictions_with_covariates)^2)

fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 0, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

plot(fit_with_covariates, xvar = "lambda", main = "Coefficient Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_with_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates, s = best_lambda)  # coefficients at the chosen lambda
## 113 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -3.5861814106
## h_bfdur_Ter(0,10.8]            -0.0779257475
## h_bfdur_Ter(10.8,34.9]          0.0008585306
## h_bfdur_Ter(34.9,Inf]           0.0763578871
## hs_bakery_prod_Ter(2,6]        -0.0232612180
## hs_bakery_prod_Ter(6,Inf]      -0.1594193125
## hs_beverages_Ter(0.132,1]       0.0028931119
## hs_beverages_Ter(1,Inf]        -0.0324969479
## hs_break_cer_Ter(1.1,5.5]       0.0038074159
## hs_break_cer_Ter(5.5,Inf]      -0.0550061557
## hs_caff_drink_Ter(0.132,Inf]    0.0203377311
## hs_dairy_Ter(14.6,25.6]         0.0400994878
## hs_dairy_Ter(25.6,Inf]         -0.0012457772
## hs_fastfood_Ter(0.132,0.5]      0.0601484486
## hs_fastfood_Ter(0.5,Inf]       -0.0272351131
## h_legume_preg_Ter(0.5,2]        0.0184597362
## h_legume_preg_Ter(2,Inf]       -0.0512443355
## hs_org_food_Ter(0.132,1]        0.0299325601
## hs_org_food_Ter(1,Inf]         -0.0448649605
## hs_proc_meat_Ter(1.5,4]         0.0006243976
## hs_proc_meat_Ter(4,Inf]        -0.0194999904
## hs_readymade_Ter(0.132,0.5]     0.0246995881
## hs_readymade_Ter(0.5,Inf]       0.0632237345
## hs_total_bread_Ter(7,17.5]     -0.0690927270
## hs_total_bread_Ter(17.5,Inf]    0.0165408959
## hs_total_cereal_Ter(14.1,23.6]  0.0029703870
## hs_total_cereal_Ter(23.6,Inf]   0.0273109910
## hs_total_fish_Ter(1.5,3]       -0.0498949858
## hs_total_fish_Ter(3,Inf]       -0.0090873302
## hs_total_fruits_Ter(7,14.1]     0.0336802077
## hs_total_fruits_Ter(14.1,Inf]  -0.0368074311
## hs_total_lipids_Ter(3,7]       -0.0018898328
## hs_total_lipids_Ter(7,Inf]     -0.0606809337
## hs_total_meat_Ter(6,9]          0.0124691415
## hs_total_meat_Ter(9,Inf]       -0.0071145855
## hs_total_potatoes_Ter(3,4]      0.0377787307
## hs_total_potatoes_Ter(4,Inf]   -0.0090633482
## hs_total_sweets_Ter(4.1,8.5]   -0.0671772450
## hs_total_sweets_Ter(8.5,Inf]    0.0004650383
## hs_total_veg_Ter(6,8.5]         0.0055347534
## hs_total_veg_Ter(8.5,Inf]      -0.0337870870
## hs_total_yog_Ter(6,8.5]        -0.0213665643
## hs_total_yog_Ter(8.5,Inf]      -0.0303531094
## hs_as_c_Log2                    0.0040435851
## hs_cd_c_Log2                   -0.0324347369
## hs_co_c_Log2                   -0.0425353065
## hs_cs_c_Log2                    0.1221782900
## hs_cu_c_Log2                    0.5402150025
## hs_hg_c_Log2                   -0.0229618178
## hs_mn_c_Log2                    0.0050341797
## hs_mo_c_Log2                   -0.0847200038
## hs_pb_c_Log2                   -0.0293800014
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -0.0454641675
## hs_ddt_cadj_Log2                0.0044273954
## hs_hcb_cadj_Log2               -0.0540715945
## hs_pcb118_cadj_Log2             0.0079995568
## hs_pcb138_cadj_Log2            -0.0569704526
## hs_pcb153_cadj_Log2            -0.1358711722
## hs_pcb170_cadj_Log2            -0.0434809137
## hs_pcb180_cadj_Log2            -0.0256847965
## hs_dep_cadj_Log2               -0.0168233284
## hs_detp_cadj_Log2               0.0053640467
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2               -0.0017685990
## hs_dmtp_cadj_Log2              -0.0001932927
## hs_pbde153_cadj_Log2           -0.0272900235
## hs_pbde47_cadj_Log2             0.0089523624
## hs_pfhxs_c_Log2                -0.0141314890
## hs_pfna_c_Log2                 -0.0052906747
## hs_pfoa_c_Log2                 -0.1143166079
## hs_pfos_c_Log2                 -0.0297311540
## hs_pfunda_c_Log2                0.0099738360
## hs_bpa_cadj_Log2               -0.0065661658
## hs_bupa_cadj_Log2               0.0032086202
## hs_etpa_cadj_Log2              -0.0038006820
## hs_mepa_cadj_Log2              -0.0125845244
## hs_oxbe_cadj_Log2               0.0072125426
## hs_prpa_cadj_Log2               0.0053787515
## hs_trcs_cadj_Log2               0.0034326693
## hs_mbzp_cadj_Log2               0.0446334946
## hs_mecpp_cadj_Log2             -0.0045884088
## hs_mehhp_cadj_Log2              0.0017001779
## hs_mehp_cadj_Log2              -0.0042613382
## hs_meohp_cadj_Log2              0.0010516738
## hs_mep_cadj_Log2                0.0052513081
## hs_mibp_cadj_Log2              -0.0287179511
## hs_mnbp_cadj_Log2              -0.0274492557
## hs_ohminp_cadj_Log2            -0.0229716236
## hs_oxominp_cadj_Log2            0.0093025927
## FAS_cat_None                    .           
## hs_contactfam_3cat_num_None     .           
## hs_hm_pers_None                -0.0161682014
## hs_participation_3cat_None      .           
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .           
## hs_smk_parents_None             .           
## e3_sex_Nonefemale              -0.0792980597
## e3_sex_Nonemale                 0.0792766913
## e3_yearbir_None2004            -0.1104559510
## e3_yearbir_None2005             0.0544534231
## e3_yearbir_None2006            -0.0140080942
## e3_yearbir_None2007             0.0036495133
## e3_yearbir_None2008             0.0250943044
## e3_yearbir_None2009             0.0019301269
## h_edumc_None2                   0.0276842155
## h_edumc_None3                   0.0264802460
## h_cohort2                      -0.1038841854
## h_cohort3                       0.2164127464
## h_cohort4                       0.2072617618
## h_cohort5                      -0.0489596800
## h_cohort6                       0.0728739983
## hs_child_age_None              -0.0096555759
best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 97 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -3.7182292825
## h_bfdur_Ter(0,10.8]            -0.0860797524
## h_bfdur_Ter(10.8,34.9]          0.0157690142
## h_bfdur_Ter(34.9,Inf]           0.0738891964
## hs_bakery_prod_Ter(2,6]        -0.0004921922
## hs_bakery_prod_Ter(6,Inf]      -0.1550947799
## hs_beverages_Ter(0.132,1]       0.0056005297
## hs_beverages_Ter(1,Inf]        -0.0294498010
## hs_break_cer_Ter(1.1,5.5]       0.0041625456
## hs_break_cer_Ter(5.5,Inf]      -0.0496482050
## hs_caff_drink_Ter(0.132,Inf]    0.0132074628
## hs_dairy_Ter(14.6,25.6]         0.0383328785
## hs_dairy_Ter(25.6,Inf]         -0.0156932735
## hs_fastfood_Ter(0.132,0.5]      0.0656064842
## hs_fastfood_Ter(0.5,Inf]       -0.0279027967
## h_legume_preg_Ter(0.5,2]        0.0518026147
## h_legume_preg_Ter(2,Inf]       -0.0515681232
## hs_org_food_Ter(0.132,1]        0.0293115265
## hs_org_food_Ter(1,Inf]         -0.0475840779
## hs_proc_meat_Ter(1.5,4]         0.0053011903
## hs_proc_meat_Ter(4,Inf]        -0.0118136041
## hs_readymade_Ter(0.132,0.5]     0.0286721450
## hs_readymade_Ter(0.5,Inf]       0.0580188327
## hs_total_bread_Ter(7,17.5]     -0.0530077370
## hs_total_bread_Ter(17.5,Inf]    0.0120103801
## hs_total_cereal_Ter(14.1,23.6]  0.0012320991
## hs_total_cereal_Ter(23.6,Inf]   0.0156516499
## hs_total_fish_Ter(1.5,3]       -0.0666413614
## hs_total_fish_Ter(3,Inf]        0.0083079643
## hs_total_fruits_Ter(7,14.1]     0.0321951554
## hs_total_fruits_Ter(14.1,Inf]  -0.0418966990
## hs_total_lipids_Ter(3,7]       -0.0108587395
## hs_total_lipids_Ter(7,Inf]     -0.0779530877
## hs_total_meat_Ter(6,9]          0.0174718115
## hs_total_meat_Ter(9,Inf]        0.0060858362
## hs_total_potatoes_Ter(3,4]      0.0526652537
## hs_total_potatoes_Ter(4,Inf]   -0.0086984587
## hs_total_sweets_Ter(4.1,8.5]   -0.0702429607
## hs_total_sweets_Ter(8.5,Inf]   -0.0035306199
## hs_total_veg_Ter(6,8.5]         0.0041934127
## hs_total_veg_Ter(8.5,Inf]      -0.0551622414
## hs_total_yog_Ter(6,8.5]        -0.0205709488
## hs_total_yog_Ter(8.5,Inf]      -0.0350202977
## hs_as_c_Log2                    0.0046556915
## hs_cd_c_Log2                   -0.0345175260
## hs_co_c_Log2                   -0.0408949090
## hs_cs_c_Log2                    0.0841409453
## hs_cu_c_Log2                    0.5332936404
## hs_hg_c_Log2                   -0.0262284450
## hs_mn_c_Log2                   -0.0160626952
## hs_mo_c_Log2                   -0.0835014660
## hs_pb_c_Log2                   -0.0220059774
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -0.0362905790
## hs_ddt_cadj_Log2                0.0037526990
## hs_hcb_cadj_Log2               -0.0310188009
## hs_pcb118_cadj_Log2             0.0073013802
## hs_pcb138_cadj_Log2            -0.0532103641
## hs_pcb153_cadj_Log2            -0.1242985374
## hs_pcb170_cadj_Log2            -0.0419218945
## hs_pcb180_cadj_Log2            -0.0242896915
## hs_dep_cadj_Log2               -0.0188114284
## hs_detp_cadj_Log2               0.0054139973
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2               -0.0025809251
## hs_dmtp_cadj_Log2               0.0009658477
## hs_pbde153_cadj_Log2           -0.0275122930
## hs_pbde47_cadj_Log2             0.0058804616
## hs_pfhxs_c_Log2                -0.0297015899
## hs_pfna_c_Log2                 -0.0065309195
## hs_pfoa_c_Log2                 -0.1074314041
## hs_pfos_c_Log2                 -0.0474644980
## hs_pfunda_c_Log2                0.0066413412
## hs_bpa_cadj_Log2               -0.0070499072
## hs_bupa_cadj_Log2               0.0037281508
## hs_etpa_cadj_Log2              -0.0047185955
## hs_mepa_cadj_Log2              -0.0097560575
## hs_oxbe_cadj_Log2               0.0091813336
## hs_prpa_cadj_Log2               0.0062288642
## hs_trcs_cadj_Log2               0.0059294563
## hs_mbzp_cadj_Log2               0.0412654953
## hs_mecpp_cadj_Log2              0.0062824833
## hs_mehhp_cadj_Log2              0.0125842196
## hs_mehp_cadj_Log2              -0.0053456689
## hs_meohp_cadj_Log2              0.0091034417
## hs_mep_cadj_Log2                0.0064257842
## hs_mibp_cadj_Log2              -0.0341283015
## hs_mnbp_cadj_Log2              -0.0329934970
## hs_ohminp_cadj_Log2            -0.0208420975
## hs_oxominp_cadj_Log2            0.0114099235
## FAS_cat_None                    .           
## hs_contactfam_3cat_num_None     .           
## hs_hm_pers_None                -0.0228004897
## hs_participation_3cat_None      .           
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .           
## hs_smk_parents_None             .
cat("Model with Covariates - Test MSE:", mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.123193
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.155814

Predicting Elastic Net

# ELASTIC NET
fit_with_covariates <- cv.glmnet(x_full_train, y_train, alpha = 0.5, family = "gaussian")
predictions_with_covariates <- predict(fit_with_covariates, s = "lambda.min", newx = x_full_test)
mse_with_covariates <- mean((y_test - predictions_with_covariates)^2)

fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 0.5, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

plot(fit_with_covariates, xvar = "lambda", main = "Coefficient Path (With Covariates)")

plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_with_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_with_covariates, s = best_lambda)  # coefficients at the chosen lambda
## 113 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                    -5.258333400
## h_bfdur_Ter(0,10.8]            -0.082442759
## h_bfdur_Ter(10.8,34.9]          .          
## h_bfdur_Ter(34.9,Inf]           0.079982963
## hs_bakery_prod_Ter(2,6]         .          
## hs_bakery_prod_Ter(6,Inf]      -0.201616093
## hs_beverages_Ter(0.132,1]       .          
## hs_beverages_Ter(1,Inf]         .          
## hs_break_cer_Ter(1.1,5.5]       .          
## hs_break_cer_Ter(5.5,Inf]       .          
## hs_caff_drink_Ter(0.132,Inf]    .          
## hs_dairy_Ter(14.6,25.6]         0.005203095
## hs_dairy_Ter(25.6,Inf]          .          
## hs_fastfood_Ter(0.132,0.5]      0.059880567
## hs_fastfood_Ter(0.5,Inf]        .          
## h_legume_preg_Ter(0.5,2]        .          
## h_legume_preg_Ter(2,Inf]       -0.056124414
## hs_org_food_Ter(0.132,1]        0.005083293
## hs_org_food_Ter(1,Inf]          .          
## hs_proc_meat_Ter(1.5,4]         .          
## hs_proc_meat_Ter(4,Inf]         .          
## hs_readymade_Ter(0.132,0.5]     .          
## hs_readymade_Ter(0.5,Inf]       0.015984706
## hs_total_bread_Ter(7,17.5]     -0.050862711
## hs_total_bread_Ter(17.5,Inf]    .          
## hs_total_cereal_Ter(14.1,23.6]  .          
## hs_total_cereal_Ter(23.6,Inf]   .          
## hs_total_fish_Ter(1.5,3]        .          
## hs_total_fish_Ter(3,Inf]        .          
## hs_total_fruits_Ter(7,14.1]     0.002674839
## hs_total_fruits_Ter(14.1,Inf]  -0.014334189
## hs_total_lipids_Ter(3,7]        .          
## hs_total_lipids_Ter(7,Inf]     -0.026344671
## hs_total_meat_Ter(6,9]          .          
## hs_total_meat_Ter(9,Inf]        .          
## hs_total_potatoes_Ter(3,4]      0.009514919
## hs_total_potatoes_Ter(4,Inf]    .          
## hs_total_sweets_Ter(4.1,8.5]   -0.028662773
## hs_total_sweets_Ter(8.5,Inf]    .          
## hs_total_veg_Ter(6,8.5]         .          
## hs_total_veg_Ter(8.5,Inf]      -0.004184539
## hs_total_yog_Ter(6,8.5]         .          
## hs_total_yog_Ter(8.5,Inf]       .          
## hs_as_c_Log2                    .          
## hs_cd_c_Log2                   -0.016093261
## hs_co_c_Log2                   -0.003072533
## hs_cs_c_Log2                    0.146384847
## hs_cu_c_Log2                    0.695195647
## hs_hg_c_Log2                   -0.021688344
## hs_mn_c_Log2                    .          
## hs_mo_c_Log2                   -0.093103311
## hs_pb_c_Log2                   -0.022060040
## hs_tl_cdich_None                .          
## hs_dde_cadj_Log2               -0.043119699
## hs_ddt_cadj_Log2                .          
## hs_hcb_cadj_Log2               -0.020701577
## hs_pcb118_cadj_Log2             .          
## hs_pcb138_cadj_Log2             .          
## hs_pcb153_cadj_Log2            -0.279263147
## hs_pcb170_cadj_Log2            -0.056595859
## hs_pcb180_cadj_Log2             .          
## hs_dep_cadj_Log2               -0.016828799
## hs_detp_cadj_Log2               .          
## hs_dmdtp_cdich_None             .          
## hs_dmp_cadj_Log2                .          
## hs_dmtp_cadj_Log2               .          
## hs_pbde153_cadj_Log2           -0.032258855
## hs_pbde47_cadj_Log2             .          
## hs_pfhxs_c_Log2                 .          
## hs_pfna_c_Log2                  .          
## hs_pfoa_c_Log2                 -0.129555577
## hs_pfos_c_Log2                  .          
## hs_pfunda_c_Log2                .          
## hs_bpa_cadj_Log2                .          
## hs_bupa_cadj_Log2               .          
## hs_etpa_cadj_Log2               .          
## hs_mepa_cadj_Log2              -0.006719327
## hs_oxbe_cadj_Log2               .          
## hs_prpa_cadj_Log2               .          
## hs_trcs_cadj_Log2               .          
## hs_mbzp_cadj_Log2               0.046066744
## hs_mecpp_cadj_Log2              .          
## hs_mehhp_cadj_Log2              .          
## hs_mehp_cadj_Log2               .          
## hs_meohp_cadj_Log2              .          
## hs_mep_cadj_Log2                .          
## hs_mibp_cadj_Log2              -0.028742762
## hs_mnbp_cadj_Log2              -0.002308112
## hs_ohminp_cadj_Log2            -0.002398690
## hs_oxominp_cadj_Log2            .          
## FAS_cat_None                    .          
## hs_contactfam_3cat_num_None     .          
## hs_hm_pers_None                 .          
## hs_participation_3cat_None      .          
## hs_cotinine_cdich_None          .          
## hs_globalexp2_None              .          
## hs_smk_parents_None             .          
## e3_sex_Nonefemale              -0.073233783
## e3_sex_Nonemale                 0.065324705
## e3_yearbir_None2004            -0.095696031
## e3_yearbir_None2005             .          
## e3_yearbir_None2006             .          
## e3_yearbir_None2007             .          
## e3_yearbir_None2008             .          
## e3_yearbir_None2009             .          
## h_edumc_None2                   .          
## h_edumc_None3                   .          
## h_cohort2                      -0.074809073
## h_cohort3                       0.391964861
## h_cohort4                       0.310207397
## h_cohort5                       .          
## h_cohort6                       0.073899674
## hs_child_age_None               .
best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 97 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                    -5.1133010038
## h_bfdur_Ter(0,10.8]            -0.1257812987
## h_bfdur_Ter(10.8,34.9]          .           
## h_bfdur_Ter(34.9,Inf]           0.0098695900
## hs_bakery_prod_Ter(2,6]         .           
## hs_bakery_prod_Ter(6,Inf]      -0.2052707951
## hs_beverages_Ter(0.132,1]       .           
## hs_beverages_Ter(1,Inf]         .           
## hs_break_cer_Ter(1.1,5.5]       .           
## hs_break_cer_Ter(5.5,Inf]       .           
## hs_caff_drink_Ter(0.132,Inf]    .           
## hs_dairy_Ter(14.6,25.6]         0.0083014798
## hs_dairy_Ter(25.6,Inf]          .           
## hs_fastfood_Ter(0.132,0.5]      0.0723166735
## hs_fastfood_Ter(0.5,Inf]        .           
## h_legume_preg_Ter(0.5,2]        0.0101366309
## h_legume_preg_Ter(2,Inf]       -0.0853205645
## hs_org_food_Ter(0.132,1]        .           
## hs_org_food_Ter(1,Inf]         -0.0004524184
## hs_proc_meat_Ter(1.5,4]         .           
## hs_proc_meat_Ter(4,Inf]         .           
## hs_readymade_Ter(0.132,0.5]     .           
## hs_readymade_Ter(0.5,Inf]       0.0083640118
## hs_total_bread_Ter(7,17.5]     -0.0123257688
## hs_total_bread_Ter(17.5,Inf]    .           
## hs_total_cereal_Ter(14.1,23.6]  .           
## hs_total_cereal_Ter(23.6,Inf]   .           
## hs_total_fish_Ter(1.5,3]       -0.0285577491
## hs_total_fish_Ter(3,Inf]        .           
## hs_total_fruits_Ter(7,14.1]     .           
## hs_total_fruits_Ter(14.1,Inf]  -0.0228116118
## hs_total_lipids_Ter(3,7]        .           
## hs_total_lipids_Ter(7,Inf]     -0.0485418209
## hs_total_meat_Ter(6,9]          .           
## hs_total_meat_Ter(9,Inf]        .           
## hs_total_potatoes_Ter(3,4]      0.0168351467
## hs_total_potatoes_Ter(4,Inf]    .           
## hs_total_sweets_Ter(4.1,8.5]   -0.0198599950
## hs_total_sweets_Ter(8.5,Inf]    .           
## hs_total_veg_Ter(6,8.5]         .           
## hs_total_veg_Ter(8.5,Inf]      -0.0429851064
## hs_total_yog_Ter(6,8.5]         .           
## hs_total_yog_Ter(8.5,Inf]       .           
## hs_as_c_Log2                    .           
## hs_cd_c_Log2                   -0.0258899345
## hs_co_c_Log2                   -0.0092126979
## hs_cs_c_Log2                    0.0676393049
## hs_cu_c_Log2                    0.6693489253
## hs_hg_c_Log2                   -0.0152313991
## hs_mn_c_Log2                    .           
## hs_mo_c_Log2                   -0.0992111766
## hs_pb_c_Log2                    .           
## hs_tl_cdich_None                .           
## hs_dde_cadj_Log2               -0.0299730899
## hs_ddt_cadj_Log2                .           
## hs_hcb_cadj_Log2                .           
## hs_pcb118_cadj_Log2             .           
## hs_pcb138_cadj_Log2             .           
## hs_pcb153_cadj_Log2            -0.2311607073
## hs_pcb170_cadj_Log2            -0.0546601693
## hs_pcb180_cadj_Log2             .           
## hs_dep_cadj_Log2               -0.0188961844
## hs_detp_cadj_Log2               .           
## hs_dmdtp_cdich_None             .           
## hs_dmp_cadj_Log2                .           
## hs_dmtp_cadj_Log2               .           
## hs_pbde153_cadj_Log2           -0.0351087073
## hs_pbde47_cadj_Log2             .           
## hs_pfhxs_c_Log2                -0.0017727290
## hs_pfna_c_Log2                  .           
## hs_pfoa_c_Log2                 -0.1215693401
## hs_pfos_c_Log2                 -0.0511509209
## hs_pfunda_c_Log2                .           
## hs_bpa_cadj_Log2                .           
## hs_bupa_cadj_Log2               .           
## hs_etpa_cadj_Log2               .           
## hs_mepa_cadj_Log2               .           
## hs_oxbe_cadj_Log2               .           
## hs_prpa_cadj_Log2               0.0003657529
## hs_trcs_cadj_Log2               .           
## hs_mbzp_cadj_Log2               0.0440420234
## hs_mecpp_cadj_Log2              .           
## hs_mehhp_cadj_Log2              .           
## hs_mehp_cadj_Log2               .           
## hs_meohp_cadj_Log2              .           
## hs_mep_cadj_Log2                .           
## hs_mibp_cadj_Log2              -0.0300733513
## hs_mnbp_cadj_Log2              -0.0122025518
## hs_ohminp_cadj_Log2             .           
## hs_oxominp_cadj_Log2            .           
## FAS_cat_None                    .           
## hs_contactfam_3cat_num_None     .           
## hs_hm_pers_None                -0.0055372885
## hs_participation_3cat_None      .           
## hs_cotinine_cdich_None          .           
## hs_globalexp2_None              .           
## hs_smk_parents_None             .
cat("Model with Covariates - Test MSE:", mse_with_covariates, "\n")
## Model with Covariates - Test MSE: 1.155518
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.200805

Group Lasso

num_chemical_diet <- ncol(x_chemical_train) + ncol(x_diet_train)
num_covariates <- ncol(x_covariates_train)

# make sure these add up to the number of columns in x_full_train
if((num_chemical_diet + num_covariates) != ncol(x_full_train)) {
  cat("Mismatch in expected column counts\n")
}

# define groups
groups <- c(rep(1, num_chemical_diet), rep(2, num_covariates))

# make sure all columns are numeric
x_full_train <- data.frame(x_full_train)
x_full_train[] <- lapply(x_full_train, function(x) {
  if(is.factor(x) || is.character(x)) {
    as.numeric(as.factor(x))
  } else {
    x
  }
})

# to ensure all variables numeric
if(any(sapply(x_full_train, function(x) !is.numeric(x)))) {
  stop("Some columns are still not numeric.")
}

if(length(groups) != ncol(x_full_train)) {
  stop("Group vector length still does not match the number of predictors.")
} else {
  cat("Group vector length now matches the number of predictors.\n")
}
## Group vector length now matches the number of predictors.
model_group_lasso <- grpreg(x_full_train, y_train, group = groups, penalty = "grLasso")

# x_full_test as numeric
x_full_test <- data.frame(x_full_test)
x_full_test[] <- lapply(x_full_test, function(x) {
  if(is.factor(x) || is.character(x)) {
    as.numeric(as.factor(x))
  } else {
    x
  }
})

if(any(sapply(x_full_test, function(x) !is.numeric(x)))) {
  stop("Some columns in x_full_test are still not numeric.")
}

x_full_test_matrix <- as.matrix(x_full_test)
predictions_group_lasso <- predict(model_group_lasso, x_full_test_matrix, type = "response")

actuals <- y_test
predicted <- predictions_group_lasso
mse <- mean((actuals - predicted)^2)
rmse <- sqrt(mse)
cat("Mean Squared Error:", mse, "\n")
## Mean Squared Error: 1.447646
cat("Root Mean Squared Error:", rmse, "\n")
## Root Mean Squared Error: 1.203181

Clustering

Trying to figure out what to do for : * cluster individuals and the individuals would be characterized by having high, medium, or low exposure (discussed last week)

set.seed(101)
x_scaled <- scale(x_full_train)  #scale the training data
wss <- sapply(1:15, function(k) {
  kmeans(x_scaled, centers = k, nstart = 20)$tot.withinss
})

plot(1:15, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K", ylab = "Total within-clusters sum of squares")

#k-means clustering with the determined number of clusters
k <- which.min(diff(diff(wss))) + 1 
km <- kmeans(x_scaled, centers = k, nstart = 25)

# plot cluster assignment
clusplot(x_scaled, km$cluster, color=TRUE, shade=TRUE, 
         labels=2, lines=0, main=paste("K-means Clustering with", k, "clusters"))

Random Forest

x_full_train[] <- lapply(x_full_train, function(x) if(is.character(x)) factor(x) else x)
x_full_test[] <- lapply(x_full_test, function(x) if(is.character(x)) factor(x) else x)

rf_model <- randomForest(x_full_train, y_train, ntree=500, importance=TRUE)

importance(rf_model)
##                                    %IncMSE IncNodePurity
## h_bfdur_Ter.0.10.8.             1.62394425     2.6472740
## h_bfdur_Ter.10.8.34.9.          1.44728015     2.3187936
## h_bfdur_Ter.34.9.Inf.           0.42661331     2.1787180
## hs_bakery_prod_Ter.2.6.         1.42322939     2.2183957
## hs_bakery_prod_Ter.6.Inf.       3.66856059     3.6033007
## hs_beverages_Ter.0.132.1.       0.30441134     1.5262195
## hs_beverages_Ter.1.Inf.        -0.05648233     1.9449951
## hs_break_cer_Ter.1.1.5.5.      -0.49893844     1.4411459
## hs_break_cer_Ter.5.5.Inf.       1.58834132     3.6238282
## hs_caff_drink_Ter.0.132.Inf.    2.88031712     1.5850729
## hs_dairy_Ter.14.6.25.6.        -0.94873273     1.7573655
## hs_dairy_Ter.25.6.Inf.         -0.11199066     1.4038995
## hs_fastfood_Ter.0.132.0.5.     -0.73851208     2.4200607
## hs_fastfood_Ter.0.5.Inf.       -1.08793318     1.8398958
## h_legume_preg_Ter.0.5.2.        1.02913452     1.5388965
## h_legume_preg_Ter.2.Inf.        2.69014804     2.5865749
## hs_org_food_Ter.0.132.1.        1.36221797     1.8403823
## hs_org_food_Ter.1.Inf.         -0.46573537     1.7616331
## hs_proc_meat_Ter.1.5.4.        -0.45678186     1.7658075
## hs_proc_meat_Ter.4.Inf.        -0.88430858     1.1803357
## hs_readymade_Ter.0.132.0.5.    -0.12606145     1.6163844
## hs_readymade_Ter.0.5.Inf.      -0.45608826     1.9371251
## hs_total_bread_Ter.7.17.5.      0.15442273     1.5456688
## hs_total_bread_Ter.17.5.Inf.    0.10735063     2.1585239
## hs_total_cereal_Ter.14.1.23.6. -0.66474298     1.3968275
## hs_total_cereal_Ter.23.6.Inf.  -0.33065996     1.5798432
## hs_total_fish_Ter.1.5.3.        0.89901191     1.7980006
## hs_total_fish_Ter.3.Inf.        0.03453010     1.4685165
## hs_total_fruits_Ter.7.14.1.    -1.01236808     2.2180663
## hs_total_fruits_Ter.14.1.Inf.  -1.17879943     1.8142755
## hs_total_lipids_Ter.3.7.        0.99576403     1.6845186
## hs_total_lipids_Ter.7.Inf.      0.90327761     1.8433470
## hs_total_meat_Ter.6.9.         -0.39508369     1.2298493
## hs_total_meat_Ter.9.Inf.       -1.14930812     1.4839761
## hs_total_potatoes_Ter.3.4.     -0.02879509     1.9667421
## hs_total_potatoes_Ter.4.Inf.    0.69057433     1.9766205
## hs_total_sweets_Ter.4.1.8.5.    0.83164986     1.9166902
## hs_total_sweets_Ter.8.5.Inf.    0.05657777     1.4126819
## hs_total_veg_Ter.6.8.5.        -0.91608707     1.4140262
## hs_total_veg_Ter.8.5.Inf.      -0.10598934     2.3876899
## hs_total_yog_Ter.6.8.5.         1.13542005     1.4888589
## hs_total_yog_Ter.8.5.Inf.       1.10424924     0.7261897
## hs_as_c_Log2                    1.64815691    16.6466486
## hs_cd_c_Log2                    2.31257107    22.1929226
## hs_co_c_Log2                    0.41498301    18.4353751
## hs_cs_c_Log2                    1.17594185    18.8005754
## hs_cu_c_Log2                    4.18415526    32.5438363
## hs_hg_c_Log2                    2.73198756    20.6874755
## hs_mn_c_Log2                    0.66078482    17.2305211
## hs_mo_c_Log2                    1.85743031    25.5068009
## hs_pb_c_Log2                    1.02771155    17.8076216
## hs_tl_cdich_None                0.93838727     1.3112074
## hs_dde_cadj_Log2                9.38861168    30.7654268
## hs_ddt_cadj_Log2                4.63649126    25.9097505
## hs_hcb_cadj_Log2               15.53697838    71.9377305
## hs_pcb118_cadj_Log2             5.09181167    24.8770565
## hs_pcb138_cadj_Log2            11.27674444    46.5834800
## hs_pcb153_cadj_Log2            11.15837325    46.9819668
## hs_pcb170_cadj_Log2            12.85250894    69.7921122
## hs_pcb180_cadj_Log2             8.57654050    30.4415970
## hs_dep_cadj_Log2                0.10535720    19.5425789
## hs_detp_cadj_Log2               0.27224855    22.2342239
## hs_dmdtp_cdich_None            -0.18119932     1.3452171
## hs_dmp_cadj_Log2                1.00411247    19.4120449
## hs_dmtp_cadj_Log2              -0.09412237    18.6242483
## hs_pbde153_cadj_Log2            8.61230311    57.2199924
## hs_pbde47_cadj_Log2            -0.49914633    19.0883815
## hs_pfhxs_c_Log2                 1.19268773    18.8772269
## hs_pfna_c_Log2                  5.31504852    21.6291066
## hs_pfoa_c_Log2                  1.70950496    27.4875166
## hs_pfos_c_Log2                  4.97902014    28.7867780
## hs_pfunda_c_Log2                0.31877488    17.4486484
## hs_bpa_cadj_Log2                0.28300198    16.1689031
## hs_bupa_cadj_Log2              -0.90177003    21.4241574
## hs_etpa_cadj_Log2               0.12822626    17.8602032
## hs_mepa_cadj_Log2              -0.51241444    16.8508187
## hs_oxbe_cadj_Log2               1.20889453    21.7818671
## hs_prpa_cadj_Log2              -0.21090768    15.7609955
## hs_trcs_cadj_Log2               3.16260000    17.8193371
## hs_mbzp_cadj_Log2               0.58725651    21.1306131
## hs_mecpp_cadj_Log2              1.59100813    13.8922651
## hs_mehhp_cadj_Log2              2.32575192    14.0681085
## hs_mehp_cadj_Log2               0.32123863    14.8664145
## hs_meohp_cadj_Log2              1.34347705    12.8261532
## hs_mep_cadj_Log2                1.46043932    15.4653443
## hs_mibp_cadj_Log2               0.27877928    16.0575165
## hs_mnbp_cadj_Log2               0.33600919    19.2808726
## hs_ohminp_cadj_Log2             5.05913120    23.5103948
## hs_oxominp_cadj_Log2           -0.46482796    17.1762792
## FAS_cat_None                    0.21983082     2.5953622
## hs_contactfam_3cat_num_None    -0.74059254     2.6902674
## hs_hm_pers_None                -0.70098893     4.9503056
## hs_participation_3cat_None      1.12391217     4.9868047
## hs_cotinine_cdich_None          1.96479354     5.2452927
## hs_globalexp2_None              1.92498116     2.3745794
## hs_smk_parents_None             0.66801139     5.1112738
## e3_sex_Nonefemale               1.76864622     2.1742532
## e3_sex_Nonemale                 0.95394065     1.6816199
## e3_yearbir_None2004             0.44368971     1.1798504
## e3_yearbir_None2005             1.52565863     1.4685544
## e3_yearbir_None2006             1.27113726     0.8881484
## e3_yearbir_None2007            -0.01228062     1.0683998
## e3_yearbir_None2008            -0.42831674     1.3756556
## e3_yearbir_None2009             0.50666864     0.6236582
## h_edumc_None2                  -0.05935642     2.0579226
## h_edumc_None3                   0.37341740     1.5026138
## h_cohort2                       1.97963049     1.3363036
## h_cohort3                       7.92754915    14.7384558
## h_cohort4                       5.37343930     4.2456046
## h_cohort5                       3.01847986     0.4778185
## h_cohort6                       2.26940445     2.5738336
## hs_child_age_None               6.07973673    22.4898732
varImpPlot(rf_model)

# predict on the test set
predictions_rf <- predict(rf_model, x_full_test)

mse_rf <- mean((y_test - predictions_rf)^2)
cat("Random Forest Test MSE:", mse_rf, "\n")
## Random Forest Test MSE: 1.194692

GBM

gbm_model <- gbm(y_train ~ ., data = x_full_train,
                 distribution = "gaussian",
                 n.trees = 1000,
                 interaction.depth = 3,
                 n.minobsinnode = 10,
                 shrinkage = 0.01,
                 cv.folds = 5,
                 verbose = TRUE)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.4346             nan     0.0100    0.0028
##      2        1.4305             nan     0.0100    0.0029
##      3        1.4259             nan     0.0100    0.0036
##      4        1.4218             nan     0.0100    0.0036
##      5        1.4183             nan     0.0100    0.0024
##      6        1.4145             nan     0.0100    0.0030
##      7        1.4099             nan     0.0100    0.0035
##      8        1.4057             nan     0.0100    0.0030
##      9        1.4022             nan     0.0100    0.0027
##     10        1.3986             nan     0.0100    0.0028
##     20        1.3637             nan     0.0100    0.0024
##     40        1.3041             nan     0.0100    0.0004
##     60        1.2531             nan     0.0100    0.0008
##     80        1.2104             nan     0.0100    0.0010
##    100        1.1750             nan     0.0100    0.0006
##    120        1.1460             nan     0.0100    0.0002
##    140        1.1186             nan     0.0100    0.0006
##    160        1.0944             nan     0.0100    0.0006
##    180        1.0722             nan     0.0100    0.0003
##    200        1.0516             nan     0.0100   -0.0003
##    220        1.0319             nan     0.0100    0.0004
##    240        1.0141             nan     0.0100   -0.0002
##    260        0.9978             nan     0.0100   -0.0002
##    280        0.9806             nan     0.0100   -0.0001
##    300        0.9661             nan     0.0100   -0.0001
##    320        0.9524             nan     0.0100   -0.0001
##    340        0.9381             nan     0.0100   -0.0001
##    360        0.9257             nan     0.0100   -0.0001
##    380        0.9130             nan     0.0100   -0.0001
##    400        0.9009             nan     0.0100   -0.0001
##    420        0.8893             nan     0.0100   -0.0005
##    440        0.8781             nan     0.0100   -0.0003
##    460        0.8674             nan     0.0100   -0.0006
##    480        0.8562             nan     0.0100   -0.0003
##    500        0.8460             nan     0.0100   -0.0003
##    520        0.8351             nan     0.0100    0.0000
##    540        0.8247             nan     0.0100   -0.0000
##    560        0.8150             nan     0.0100   -0.0003
##    580        0.8052             nan     0.0100   -0.0002
##    600        0.7954             nan     0.0100   -0.0001
##    620        0.7864             nan     0.0100   -0.0001
##    640        0.7778             nan     0.0100   -0.0004
##    660        0.7692             nan     0.0100   -0.0007
##    680        0.7612             nan     0.0100   -0.0004
##    700        0.7528             nan     0.0100   -0.0003
##    720        0.7446             nan     0.0100   -0.0001
##    740        0.7368             nan     0.0100   -0.0002
##    760        0.7292             nan     0.0100   -0.0002
##    780        0.7213             nan     0.0100   -0.0005
##    800        0.7138             nan     0.0100   -0.0001
##    820        0.7059             nan     0.0100   -0.0002
##    840        0.6992             nan     0.0100   -0.0004
##    860        0.6926             nan     0.0100   -0.0003
##    880        0.6858             nan     0.0100   -0.0003
##    900        0.6788             nan     0.0100   -0.0003
##    920        0.6724             nan     0.0100   -0.0002
##    940        0.6661             nan     0.0100   -0.0003
##    960        0.6598             nan     0.0100   -0.0002
##    980        0.6541             nan     0.0100   -0.0001
##   1000        0.6478             nan     0.0100   -0.0002
predictions_gbm <- predict(gbm_model, x_full_test, n.trees = 1000, type = "response")

mse_gbm <- mean((y_test - predictions_gbm)^2)

cat("GBM Test MSE:", mse_gbm, "\n")
## GBM Test MSE: 1.122145
summary(gbm_model)

Metabolomic Serum Data

First 10 rows and columns of the metabolomic serum data

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
kable(metabol_serum.d[1:10,1:10], align="c", digits=2, format="pipe")
430 1187 940 936 788 698 380 196 114 885
metab_1 -2.15 -0.69 -0.69 -0.19 -1.96 -1.90 -0.22 -1.38 -0.54 -1.25
metab_2 -0.71 -0.37 -0.36 -0.34 -0.35 -0.63 -0.26 -0.46 -0.44 -0.48
metab_3 8.60 9.15 8.95 8.54 8.73 8.24 9.03 8.29 8.37 8.18
metab_4 0.55 -1.33 -0.13 -0.62 -0.80 -0.46 0.49 0.12 -0.76 -0.07
metab_5 7.05 6.89 7.10 7.01 6.90 6.94 6.77 6.62 6.85 7.24
metab_6 5.79 5.81 5.86 5.95 5.95 5.42 5.82 5.65 5.44 5.60
metab_7 3.75 4.26 4.35 4.24 4.88 4.70 4.08 4.73 3.98 4.30
metab_8 5.07 5.08 5.92 5.41 5.39 4.62 5.10 5.28 4.51 5.45
metab_9 -1.87 -2.30 -1.97 -1.89 -1.55 -1.78 -2.29 -1.64 -2.02 -1.68
metab_10 -2.77 -3.42 -3.40 -2.84 -2.45 -3.14 -3.36 -2.88 -3.05 -2.92